=Paper=
{{Paper
|id=Vol-1733/paper-01
|storemode=property
|title=Optimization Techniques for 2-hop Labeling of Dynamic Directed Acyclic Graphs
|pdfUrl=https://ceur-ws.org/Vol-1733/paper-01.pdf
|volume=Vol-1733
|authors=Jinhyun Ahn
|dblpUrl=https://dblp.org/rec/conf/semweb/Ahn16
}}
==Optimization Techniques for 2-hop Labeling of Dynamic Directed Acyclic Graphs==
Optimization Techniques for 2-hop Labeling of Dynamic Directed Acyclic Graphs Jinhyun Ahn Biomedical Knowledge Engineering Laboratory and Dental Research Institute Seoul National University jhahncs@snu.ac.kr Abstract. Graph databases are increasingly used for RDF(Resource Description Framework) data management. Mining reachability relation- ships between resources is one of important building blocks for graph databases. Reachability relationships in a graph and its corresponding directed acyclic graph(DAG) are equivalent when we focus on reachabil- ity alone, which allows to focus on DAGs in this thesis. Diverse label- ing schemes have been proposed to efficiently determine the reachability of DAGs. We focus on a state-of-the-art 2-hop labeling scheme that is based on a permutation of vertices to achieve a linear index size and reduce on-line searches that are required when the reachability cannot be answered by 2-hop labels only. We observed that the approach can be improved in three different ways; 1) space-efficiency guarantee the minimized index size without randomness 2) update-efficiency update labels efficiently when graphs changes 3) parallelization labeling should be cluster-based, and solved in a distributed fashion. In these regards, this PhD thesis proposes optimization techniques that address these is- sues. In this paper in particular, a way of reducing the 2-hop label size is proposed with preliminary results on real-world DAG datasets. In ad- dition, we will discuss the feasibilities of the other issues based on our on-going works. 1 Problem statement A 2-hop labeling scheme of a DAG is to label each vertex v with a pair (Lout (v), Lin (v)), where Lout (v) is a set of vertices that v can reach and Lin (v) is a set of vertices reachable from v [4]. In this thesis, we focus on a state-of-the-art variation of 2-hop labeling [14], where a permutation of vertices is used to allow Lout (v) and Lin (v) to keep at most k reachable vertices, which probabilistically guarantees reduction in on-line Depth-First-Search(DFS), a condition required when the reachability cannot be answered by these labels only. We briefly review the labeling scheme as follows. 2-hop Labeling based on Independent Permutation (IP) Let G(V, E) be a DAG with V as a set of vertices and E as a set of edges pairing two vertices. u ↝ v is used to denote that u can reach v, and u ↝̸ v, otherwise. A permutation of V is a bijection denoted as σ ∶ V Ð→ V . Given G and a positive integer k, IP ࢜ ߪ ሺ࢜ሻ ࡸ࢛࢚ ሺ࢜ሻ ࡸ ሺ࢜ሻ ࢜ ߪ ሺ࢜ሻ ࡸ࢛࢚ ሺ࢜ሻ ࡸ ሺ࢜ሻ ܫ ݁ݖ݅ݏఙଷభ ܸ ൌ ͳʹ ܫ ݁ݖ݅ݏఙଷమ ܸ ൌ ͳ͵ͻ Fig. 1: The 2-hop index size of the same graph varies according to σ13 and σ23 . randomly generates σ and then outputs the 2-hop index Iσk ∶ V Ð→ H, where H is a set of pairs (Lout (v), Lin (v)) defined as: Lout (v) = mink {σ(u)∣v ↝ u} Lin (v) = mink {σ(u)∣u ↝ v} , where mink {A} is a sub-set of A, such that Ai < Aj if i < j and ∣mink {A}∣ ≤ k. In other words, mink {A} contains the k smallest integers in A. In the original version of IP, σ is randomly generated using the Knuth shuffle algorithm. We argue that the randomized way of generating a permutation is unable to ensure that the minimized size of index is obtained. Therefore, we define the first problem that generates a space-efficient permutation defined in Definition 1. Definition 1. Space-efficient Permutation Given G(V, E) and a positive integer k, output σ such that 1 argmin size(Iσk (V )) ∶= ∑ space(Lout (v)) + space(Lin (v)) σ ∣V ∣ v∈V , where space(L) is the space requirements for a set L of integers, which can be defined according to the application requirements. An illustrative example that shows the effect of σ on the index size is depicted in Fig. 1, where we assume space(L) = ∑w∈L w to see the index size difference for the very small DAG. Linked Open Data1 (LOD) evolves over time [9]. When a DAG Gt changes into Gt+1 , the straightforward way is to construct the 2-hop index against Gt+1 from scratch. As this is not feasible for massive DAGs in LOD, we define a second problem that generates an update-efficient permutation by which 1) a smaller number of existing labels are modified and 2) a smaller index size is maintained, as defined in Definition 2. Since there is a trade-off between these criteria, weights 1 http://linkeddata.org are used according to requirements. The first factor of the equation in Definition 2 represents the number of vertices in the current version Vv+1 whose labels are not the same as in the previous version in Iσk (v). Definition 2. Update-efficient Permutation Assume we already have Gt (Vt , Et ) and its 2-hop index Iσk (Vt ). Given Gt+1 (Vt+1 , Et+1 ), generate a new permutation σt+1 such that argmin α × ∣{v ∈ Vt ⋂ Vt+1 ∣Iσkt (v) ≠ Iσkt+1 (v)}∣ + β × size(Iσkt+1 (Vt+1 )) σt+1 , where α ≥ 0, β ≥ 0. Meanwhile, existing 2-hop labeling algorithms running on a single machine are not efficient enough to address the above problems for massive DAGs. Our third problem is thus to devise a cluster-based 2-hop labeling algorithm that supports Definition 1 and 2. 2 Relevancy Reachability relationships in a graph and its corresponding DAG are equivalent, since a graph can be transformed into a DAG by condensing every strongly connected component(SCC) into a single vertex and retaining edges between vertices in SCC and the other vertices that are connected to SCC [13]. In terms of reachability relationships, LOD can be viewed as a real-world DAG dataset. A vast amount of knowledge is available in LOD, including social networks, biological networks, traffic networks, and software version management. Protein- protein interaction networks, for example, can be represented in DAGs. One may be interested in mining interactions (i.e., reachability relationships) between two proteins (i.e., vertices) [5]. Regarding RDF triple stores, reachability relationship identification helps process SPARQL queries efficiently [6]. 3 Related Work There are extensive studies on labeling of DAGs for reachability queries pro- cessing [17]. The straightforward way is to pre-compute the edge transitive clo- sures(TC) [11]. Even if this method provides an answer to a reachability query in O(1) time, the index could be huge for even small dense graphs. On-line traver- sal, such as DFS and Bread-First Search(BFS), do not require any index but lead to slow query processing. Various approaches have been made to address trade-offs between the index size and the speed of query processing. Some au- thors have adapted the prime number labeling scheme to DAGs [15]. However, prime numbers quickly grow to large values. Tree Cover [1] labels a vertex v with a compressed TC of the subtree rooted at v. [12] present GRIPP(GRaph Indexing based on Pre- and Postorder numbering) that utilizes spanning trees based on an interval label scheme. GRAIL [16] is based on randomized multi- ple interval labeling. FERRARI [10] labels each vertex with a mixture of exact and approximate reachability intervals, where subsets of intervals are merged into approximate intervals. 2-hop labeling [4] labels each vertex v with a pair (Lout (v), Lin (v)). Recently, a state-of-the-art variation of 2-hop labeling has been proposed in [14] that reports outstanding performance compared to existing approaches. A permutation of vertices is first generated randomly. MinHash is adapted to construct 2-hop labels; at most k reachable vertices are only maintained in 2- hop labels. For pairs that cannot be determined by k reachable vertices only, an on-line search is performed. Some heuristics are also presented to minimize the case that requires exhaustive on-line DFS searches. 4 Research questions The research objective is to optimize the IP labeling [14] in terms of index size, updating cost, and scalability. We want to answer the following research ques- tions (where difficulties regarding the questions are also discussed). 1) Is it possible to deterministically choose a permutation σ of V that best minimizes the index size? Because the number of candidate permutations is n!, a brute-force algorithm is not feasible. One may choose a permutation by sorting V using a topological sort. Smaller numbers are assigned to vertices having more reachable vertices. It can be expected that this method allows Lin (v) to have smaller numbers. However, conversely, Lout (v) would have large numbers, offsetting smaller size of Lin (v). A more sophisticated technique is required. 2) When a new vertex vnew is added to a DAG Gt (Vt , Et ) that has already been labeled by σt , which number should be assigned to σt+1 (vnew )? The simplest way is to make σt+1 (vnew ) = max(σt (Vt )) + 1. However, with this approach, the minimized index size is not always ensured. 3) In order to do 2-hop labeling in a cluster, when vertices are distributed, how can we know reachable vertices from a vertex? Given a set M of machines, we may distribute a subset Vi of V to a machine Mi and then perform labeling independently. The challenge is to obtain Lin (v) and Lout (v) for a vertex v ∈ Vi residing in Mi whereas some o ∈ Lin (v) ⋃ Lout (v) have been distributed to the other machines Mj , such that i ≠ j. 5 Hypotheses Several hypotheses regarding research questions are stated as follows. – If smaller numbers are assigned to σ(v) because v has more edges, then the 2-hop index size is reduced. – The reachability query processing performance increases as index size is re- duced. Table 1: Real-world DAG datasets Name Vertices Edges Average Degree (max.) Median Degree arxiv 6,000 66,707 22.2 (700) 14 go 6,793 13,361 3.9 (71) 3 pubmed 9,000 40,028 8.9 (432) 4 citeseer 340,945 312,282 1.8 (55,758) 1 citpatents 3,774,768 16,518,947 8.8 (793) 6 go-uniprot 6,967,383 34,769,339 10.0 (1,186,280) 4 yago 16,375,031 25,908,132 3.2 (732,895) 1 – If 2-hop labeling is carried out in a cluster, then the index construction and update would be faster than existing single machine based algorithms are. 6 Approach To prove the first hypotheses, we first compute the degree of each of the vertices in V and then sort V in decreasing order by degree. The order is regarded as the desired permutation σ(V ) as follows. σ(v) = i , where v is the i-th element in V degree such that V degree is a sorted set of V , sorted by decreasing degree of v. In other words, the more degrees v has, the smaller the number assigned to σ(v). This is derived by an expectation that if v has many edges, v is likely to become a reachable vertex from another vertex. The overall 2-hop index size could be reduced by making σ(v) smaller. Note that the current approach does not take into account the parameter k. We plan to introduce the parameter k to guarantee a reduced index size for arbitrary k. Regarding a cluster-based algorithm, we are motivated by the approach in [3] that proposed cluster based labeling algorithms for trees. In that work, Mapper performs incomplete labeling and then Reducer completes the labels by referring to the offset table shared by all machine. The offset table is constructed based on information collected from each Mapper, which contains information needed to complete the labels in Reducers. We plan to adapt the idea in a manner that effectively deals with large DAGs in a cluster. 7 Preliminary results We implemented the first proposed approach based on the source codes provided by the authors of IP2 . The proposed approach (denoted as IP adv) is evaluated compared with the original approach (IP) and baseline (IP fix). IP generates a 2 http://www1.se.cuhk.edu.hk/˜hwei/source/IP.zip 30.0 30.0 30.0 25.0 25.0 25.0 20.0 20.0 20.0 15.0 15.0 15.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 (a) arxiv (b) go (c) pubmed 45.0 45.0 45.0 40.0 35.0 40.0 40.0 30.0 25.0 20.0 35.0 35.0 15.0 10.0 5.0 30.0 30.0 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 (d) citeseer (e) citpatents (f) go-uniprot 25.0 20.0 IP IP_fix 15.0 IP_adv 1 2 3 4 5 6 7 8 9 10 (g) yago Fig. 2: 2-hop index sizes over 10 runs. The Y-axis indicates the index size, where the space requirements is space(L) = ∑w∈L len(w) such that len(w) is the num- ber of digits in w. The X-axis indicates runs. k is set to 5. random permutation, which means that a different index is constructed for each run. IP fix is based on the identity permutation such that σ(v) = j, where v is the j-th element in V . All the experiments were conducted on a machine with a 2.4 GHz CPU and 40 GB of RAM. We downloaded the real-world DAG datasets3 . Statistics of the datasets are listed in Table 1. The index sizes for each dataset are plotted in Fig. 2. The index size by IP adv and IP fix are compared with the index sizes generated by 10 runs of IP. For each run, only IP showed the different index size due to its random nature. IP adv showed the best performance for all datasets. For small datasets with large degrees such as arxiv and pubmed, relatively small improvements are observed. For large datasets, we observed relatively large improvements, except 3 https://code.google.com/archive/p/ferrari-index/downloads for citeseer and yago. It can be seen in Table 1 that these datasets have relatively small median degrees. Since IP adv is based on degrees, too small or too large degrees would not significantly contribute to a reduction in the index size. In future works, we plan to utilize more graph metrics to work better against such datasets. 8 Evaluation plan We plan to collect LOD datasets and then transform them into DAGs by con- densing SCC. Synthetic DAG datasets will also be considered, varying graph metrics such as in/out-degree, diameter, and vertices with no ancestors or de- scendants. In particular, several releases of DBpedia datasets4 will be collected in order to evaluate the performance of updating the index. To examine the pros and cons of our approach in greater details, we will compare it to a number of notable existing approaches such as [10, 16, 18]. Regarding our cluster-based algorithm, as there exist few works on cluster-based 2-hop labeling, we will mod- ify existing cluster-based tree labeling algorithms [3] and consider these as the baseline. 9 Reflections It remains a challenge to apply graph reachability indexing techniques to very large sets of LOD. In this paper, we showed that simple graph metrics can be exploited to reduce the index size. By improving the proposed approach further, it will become more applicable to LOD. Another characteristic of LOD is that it changes over time. We have some experience on scalable RDF change detection [2, 8, 7] that outputs added and removed triples for two given RDF dataset versions. By utilizing this system, it would be possible to update only a portion of an existing 2-hop index while avoiding re-labeling. To perform these algorithms efficiently in a scalable manner, we can utilize distributed computing frameworks such as MapReduce and Spark. Acknowledgement I would like to acknowledge my advisors, Hong-Gee Kim (Seoul National Uni- versity) and Dong-Hyuk Im (Hoseo University). This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIP) (No. R0101-16-0054, WiseKB: Big data based self-evolving knowledge base and reasoning platform). 4 http://wiki.dbpedia.org/services-resources/datasets/ previous-releases References 1. Agrawal, R., Borgida, A., Jagadish, H.V.: Efficient management of transitive rela- tionships in large data and knowledge bases, vol. 18. ACM (1989) 2. Ahn, J., Im, D.H., Eom, J.H., Zong, N., Kim, H.G.: G-Diff: A Grouping Algorithm for RDF Change Detection on MapReduce. In: Semantic Technology, pp. 230–235. Springer (2014) 3. Choi, H., Lee, K.H., Lee, Y.J.: Parallel labeling of massive xml data with mapre- duce. The Journal of Supercomputing 67(2), 408–437 (2014) 4. Cohen, E., Halperin, E., Kaplan, H., Zwick, U.: Reachability and distance queries via 2-hop labels. SIAM Journal on Computing 32(5), 1338–1355 (2003) 5. Gabr, H., Kahveci, T.: Signal reachability facilitates characterization of probabilis- tic signaling networks. BMC Bioinformatics 16(Suppl 17), S6 (2015) 6. Gubichev, A., Bedathur, S.J., Seufert, S.: Sparqling kleene: fast property paths in rdf-3x. In: First International Workshop on Graph Data Management Experiences and Systems. p. 14. ACM (2013) 7. Im, D.H., Lee, S.W., Kim, H.J.: Backward inference and pruning for RDF change detection using RDBMS. Journal of Information Science pp. 238–255 (2012) 8. Im, D.H., Lee, S.W., Kim, H.J.: A version management framework for RDF triple stores. International Journal of Software Engineering and Knowledge Engineering 22(01), 85–106 (2012) 9. Käfer, T., Umbrich, J., Hogan, A., Polleres, A.: Towards a dynamic linked data observatory. LDOW at WWW (2012) 10. Seufert, S., Anand, A., Bedathur, S., Weikum, G.: Ferrari: Flexible and efficient reachability range assignment for graph indexing. In: Data Engineering (ICDE), 2013 IEEE 29th International Conference on. pp. 1009–1020. IEEE (2013) 11. Simon, K.: An improved algorithm for transitive closure on acyclic digraphs. The- oretical Computer Science 58(1), 325–346 (1988) 12. Trißl, S., Leser, U.: Fast and practical indexing and querying of very large graphs. In: Proceedings of the 2007 ACM SIGMOD international conference on Manage- ment of data. pp. 845–856. ACM (2007) 13. Veloso, R.R., Cerf, L., Meira Jr, W., Zaki, M.J.: Reachability queries in very large graphs: A fast refined online search approach. In: EDBT. pp. 511–522. Citeseer (2014) 14. Wei, H., Yu, J.X., Lu, C., Jin, R.: Reachability querying: An independent permu- tation labeling approach. Proceedings of the VLDB Endowment 7(12), 1191–1202 (2014) 15. Wu, G., Zhang, K., Liu, C., Li, J.: Adapting prime number labeling scheme for directed acyclic graphs. In: Database Systems for Advanced Applications. pp. 787– 796. Springer (2006) 16. Yıldırım, H., Chaoji, V., Zaki, M.J.: GRAIL: a scalable index for reachability queries in very large graphs. The VLDB JournalThe International Journal on Very Large Data Bases 21(4), 509–534 (2012) 17. Yu, J.X., Cheng, J.: Graph reachability queries: A survey. In: Managing and Mining Graph Data, pp. 181–215. Springer (2010) 18. Zhu, A.D., Lin, W., Wang, S., Xiao, X.: Reachability queries on large dynamic graphs: a total order approach. In: Proceedings of the 2014 ACM SIGMOD inter- national conference on Management of data. pp. 1323–1334. ACM (2014)