=Paper=
{{Paper
|id=Vol-2652/paper11
|storemode=property
|title=Efficient and Scalable Aggregate Computation on Temporal Graphs
|pdfUrl=https://ceur-ws.org/Vol-2652/paper11.pdf
|volume=Vol-2652
|authors=Vincent Le Claire
|dblpUrl=https://dblp.org/rec/conf/vldb/Claire20
}}
==Efficient and Scalable Aggregate Computation on Temporal Graphs==
Efficient and Scalable Aggregate Computation on Temporal Graphs Vincent Le Claire supervised by Prof. Dr. Peter Fischer University of Augsburg Augsburg, Germany vincent.leclaire@informatik.uni-augsburg.de ABSTRACT a smaller set of meaningful metrics, as often seen in, e.g., Many applications such as social networks generate big vol- business intelligence. umes of graph data that has additional temporal information In the context of aggregates in temporal graphs, consider or changes rapidly, which has lead to a significant amount of the following example: In a social network, users can be research on temporal graphs. Existing and ongoing work on grouped into several communities that have stronger bounds temporal graphs has focused on path problems and graph inside than to the outside. A query for this use case might databases in general. Aggregations, which are very common be: “For the most connected member of each community, for relational data, are just as insightful for temporal graph how much did the number of friends change over the progress data, but need to be computed efficiently and scalable. To of every calendar week of last year?” That query has sev- aggreagte efficiently, useful operations on graphs need to eral complementary aspects: (a) It accesses the graph in be selected, the composition of aggregation functions needs substructures of varying amount (nodes, neighbours, com- to be investigated, and the distribution of the calculation munities), which resembles (but exceeds) the grouping of re- must be studied. After tackling the research question with lational aggregations. (b) It has a temporal aspect of varying the simplifying assumption of static communities (which are granularity (first, the number of friends is calculated for each highly coupled nodes), then the parallelization of the prob- week; second, only the data of the last year is examined). (c) lem is investigated, and finally, the simplifying assumption It uses different aggregation functions (here: sum, argmax). is removed for the general case of dynamic communities. As (d) It re-uses an aggregation (the sum of the friends has to aggregations play an important role in (temporal) relational be calculated for every node because it is needed to find the data, this direction of research might establish them in tem- most connected member in a first step). poral graphs just as well. 1.1 Motivation The core of this research proposal is to investigate ag- 1. INTRODUCTION gregates on temporal graphs rather than arbitrary temporal Much data in the real world can naturally be expressed graph algorithms. There are several reasons for this focus: as a graph. For example, in a social network, graph nodes can represent people, and graph edges then represent that • There are many use cases of aggregations, among them, two people are friends in this social network. As another for example, weekly analytics of air traffic connections, example, consider a traffic network that consists of streets or averaging the number of contacts between people that are interconnected by crossways. Many of these graphs during a global outbreak of a disease. Furthermore, are not static but change constantly. Social networks, as many use cases of relational temporal aggregates are they are intended for interaction, are obviously dynamic. applicable for graphs, too. The importance of aggre- Likewise, traffic networks are not static as temporary road gates on temporal graph data is also recognized by works or other means of blocked roads are sources of change. their use in benchmarks: The LDBC Social Network For temporal graphs (aka dynamic graphs), the wide range Benchmark uses a temporal graph for its data, and its of use cases results in a plethora of different definitions and business intelligence workload sports several temporal algorithms, as no common view or formalization has – so graph data aggregations. far – been achieved. One important aspect of (not only temporal) graph data is the aggregation of many values to • For several temporal graph problems, there exist ef- ficient, often parallel and/or incremental algorithms. Yet, these algorithms tend to make specific assump- tions on the data model and the workload; when the uses cases are more general or differ slightly in their assumptions, the scalability suffers. Considering the more limited scope, temporal aggregates over graphs are likely to provide more room for optimization and Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo, Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted less sensitivity to workload and data model changes – for private and academic purposes. we will outline this reasoning later in this paper. • Despite all the points outlined above, aggregates over change over time. A custom data structure called DPS (Dy- temporal graphs have not been a major research direc- namic Property Storage) handles the dynamic attributes of tion so far. This is in stark contrast to classical graph nodes and edges, while the graph itself and the static at- problems such as shortest path or adapting existing tributes of the nodes and edges are kept in the existing Neo4j workloads on temporal data using snapshots. format. Google’s Pregel [20] is an early example of a graph pro- 2. PREVIOUS WORK cessing system, introduced in 2010. It is a distributed sys- tem for big graphs providing a very fine-grained concur- 2.1 Use Cases of Temporal Graphs rency model; like MapReduce [7], it relies on synchronization rounds (BSP). GraphLab [19] positions itself in the machine One of the first examples of using temporal graphs is learning domain, it thus investigates relaxed coordination an article by Cervoni et al. [4]. The authors use temporal models. Kineograph [6] is a distributed graph system for dy- graphs for the calculation of temporal constraints. namic (that means, frequently changing) graphs. It stores Use cases of temporal graph metrics have been mentioned its graph as snapshots for different points in time. Simi- by Tang et al. [24]. Nicosia et al. [21] describe some metrics larly, GraphTau [13] is distributed and stores snapshots; it and differences to their static graph counterparts. builds upon Apache Spark. GraphChi [17] focuses on disk- Holme and Saramäki [11] review many use cases for tem- based evolving graphs. A recent general-purposes system to poral networks and describe the different modeling of them. analyze huge temporal graphs is Gradoop [14]. There are examples of the usage of temporal graphs in Generally speaking, graph databases tend to be limited medicine: Wainer and Sandri [26] use them for medicial di- in scalability for analytical temporal workloads, while the agnostics. Liu et al. [18] model medical events of a single broad support for temporal graph operations in processing patient and their relationship using temporal graphs. system makes it hard to devise optimization specific to ag- For visualizing scientific publications and their interrela- gregations. tions, Erten et al. [8] use temporal graphs. The wide range of uses of temporal graphs (along with our 2.4 Incremental Aggregate Maintenance motivating example) reassures us that the addition of tem- Efficient computation of graph aggregates may draw heav- poral information to graphs is sensible and not an academic ily from existing work on incremental aggregate computa- niche. tion, notably from (a) incremental view maintenance and 2.2 Temporal Databases and Aggregates (b) incremental computation of streaming aggregates. Materialized views are well established in relational data- The research on temporal aggregation dates back to the bases; Halevy [10] gives an extensive overview of how they 1990s, co-occuring to the seminal work on temporal data- are used to answer queries. Materialized views contain the bases. One of the first publications regarding temporal ag- result of a query, which can be, for example, an aggregation. gregation (on relational databases) is by Snodgrass et al. [23], When the underlying data changes, the materialized view which is a direct result of their introduction of TSQL2. is updated. Many approaches exist to perform this in an TSQL2 added temporal semantics on top of the SQL stan- incremental manner. There is an additional interesting line dard, which has become mostly obsolete by the inclusion of work that is helpful for our problem setting: Answering of temporal semantics into SQL:2011. Following, Kline and queries with views and thus view containment could provide Snodgrass further evaluate the calcuation of temporal ag- valuable insights on stacking distinct aggregates – which is gregates in [16]. Zhang et al. [27] appended ranges to the clearly harder than, for example, a cube operator with the temporal aggregates. same aggregation function. Böhlen et al. [2] focus on multi-dimensional temporal ag- Tangwongsan et al. [25] provide a comprehensive solution gregates. They distinguish between two (or more) time di- on sliding-window aggregations. For in-order arrival/expiry mensions: one or more application times, which describe at of the data they achieve O(1), while for out-of-order expiry which time a tuple is valid, and a system time, which defines they still achieve O(1) in the best case (which is in-order) the time where the database system is aware of the tuple. and up to O(log n) in total out-of-order execution. The lat- Cheng [5] describes the aggregation of null-time intervals. ter model matches well with arbitrary lifetime intervals of In 2013, Kaufmann et al. [15] introduce the Timeline In- temporal data. dex, which is a main-memory structure that efficiently sup- ports temporal aggregations on system time and other tem- poral operations in a relational database. Ideas of the Time- 3. PROBLEM DEFINITION line Index will play an important role in our first step of re- The problem we are thus trying to tackle in our work is to search, as we will show in Section 4. Other work mentioned efficiently compute (combinations of) aggregate values over in this section lays the foundation to understand what tem- (possibly changing) substructures of highly dynamic, huge poral aggregations are and what expressive power they have. graphs. The implications of this problem statement can further be 2.3 Processing Systems and Databases broken down along the following dimensions: For our purposes, managing large-scale evolving graphs relates to both graph databases and processing systems: 1. Graph properties to aggregate: Compared to re- Fernandes and Bernardino [9] compare several graph data- lational or streaming/ordered models, graphs allow for base systems. The most popular among them is Neo4j [1]. an additional range of (possibly very costly and hard- TGraph [12] is an ACID-compliant extension of Neo4j for to-optimize) operations such as reachability or shortest temporal range queries. TGraph assumes that the graph paths. These may serve as an input to the aggregate structure changes rarely, while attributes of nodes and edges metrics (such as betweenness centrality), yet the effort to compute them may outstrip the cost of the aggre- the underlying structure of a graphs is one of the most in- gations. teresting challenges when adapting Timeline, as it affects It is therefore an important tradeoff to consider how the scope of aggregations (e.g. over nodes or communities) rich the support for such operations needs to be. Our as well as the means of partitioning the graph for parallel current take is to allow limited neighborhood access execution. To make this challenge more tractable, we first but not arbitrary reach/iteration. keep the aggregation scopes static and drop this require- ment in our last step. Additionally, we consider also ma- 2. Hierarchical composition of distinct aggregate terializing (partial) aggregate values to speed up aggregate functions: While a significant body of works exists on computation over time, graph structure and aggregate func- refining/combining values over the same aggregation tion composition. In turn, this also introduces an additional function (like drill-down or roll-up data cubes), using partitioning problem. several distinct aggregation functions “on top” of each other (e.g. maximum on top of sum in the example) 4.1 Interfacing with Graph Computations is not as well studied for effective evaluation. As a first step, we are investigating how to best express 3. Several dimensions of varying granularity the usage of graph properties required for the metric com- putation. The goal is to define a suitable subset of graph (a) Time: As already observed, temporal aggregates operations that can be computed in an incremental manner may cover varying degrees of time (points, dis- over temporal data and can therefore “drive” an efficient joint/overlapping intervals). In aggregation hier- temporal aggregation process. While some operations are archies, the granularities may necessarily have a obviously both useful for metrics and easy to derive (such containment relationship (e.g. weeks and months). property values on a single node or edge) and others are use- ful, but extremely hard to compute incrementally (general (b) Graph structure: While relational data is typ- temporal paths), the space in between is not well-charted. ically “grouped” over an attribute or combina- We plan to investigate a broad range of metrics from use tions of attributes (which often have an obvious cases in order to further understand the requirements. The containment relationship), graphs provide a wider results will allow use to also adapt the design of the graph range of options, covering individual nodes, neigh- data storage and programming interface. borhoods, communities, connected components or the entire graph. Determining these “groups” may itself be a complex and expensive operation. 4.2 Composition of Aggregates Furthermore, on temporal data the membership A second, but orthogonal problem is the combination of of sets of lower-level groups may change over time, multiple aggregations into a hierarchy. While a wide range such as nodes changing their community, making of single-level aggregations may be supported for incremen- partial computation harder to maintain. tal computation by applying ideas of, e.g., Tangwongsan et al. [25] (maybe with extensions for update sets), re-using 4. Distributed computation: The expected problem partial results over a DAG of distinct aggregation functions size makes single-thread, main-memory approaches as clearly raises its own set of challenges. We plan to inves- well as full recomputation on change not very appeal- tigate expression or view containment approaches to con- ing. While these aspects are fairly well-understood sider both dependencies among aggregation functions and for relational streaming and ongoing work exist for the varying granularities of time and graph structure, pos- static graphs, research for partitioning and distributed sibly deriving “core” aggregation parts in lower levels that state management of dynamic graphs is still in its early can be shared for multiple aggregates or combined among stages. these dimensions. Furthermore, deciding on when and what to materialize will be further area of investigation. To narrow the problem definition, we assume a changing graph whose nodes and edges may have additional proper- 4.3 Parallelization ties assinged that may also change over time, similar to the In order to achieve a significant amount of scalability, the approach Huang et al. have for TGraph [12]. To give more computations need to be parallelized. expressive power, we assume an interval time model instead We see two main directions: (a) Partitioning (social) graph of only allowing points in time for queries. At this time, we data in order to both maximize parallel computation and do not assume a graph with directed nor undirected edges. minimize communication is an ongoing challenge in the re- search community, e.g., due to skewed distribution of the 4. RESEARCH PLAN data. We expect that some parts of the changelog, the snap- An interesting foundation for this type of research is the shots, and the partial aggregates may just cover a small sub- Timeline Index approach of Kaufmann et al. [15]. It ex- set of highly active and highly connected nodes, while other presses the (relational) temporal data as a changelog of acti- parts may cover larger sets of lower activity. We also may vated/deactived data points. Validity information snapshots have to investigate possible “cuts” within graph structures reduce the space to be processed and allow for archival or such as communities or even nodes, so partial aggregates distribution. Due to these design choices and also due to can be computed with more parallelism. the optimization for main memory, it provides an efficient (b) In contrast to general graph computations, many ag- underpinning for incremental aggregations. gregate functions do not require strict consistency rules in Temporal graphs can clearly also be expressed as a (set) order to produce correct results, similar to what CRDTs [22] of changelog(s) of nodes, edges and properties. Dealing with can achieve in eventual consistency environements. The temporal validity information provided with the data ele- [8] C. Erten et al. Exploring the computing literature ments facilitates the reconciliation of different “episodes” using temporal graph visualization. In Visualization and may further be used to derive the synchronization bound- and Data Analysis 2004, volume 5295, pages 45–56. aries as only changes in the data require actual coordination. International Society for Optics and Photonics, 2004. [9] D. Fernandes and J. Bernardino. Graph databases 4.4 Dynamic Communities comparison: Allegrograph, arangodb, infinitegraph, In our last step, we drop the assumption of static graph neo4j, and orientdb. In DATA, pages 373–380, 2018. structures, so that the graph structures we aggregate over [10] A. Y. Halevy. Answering queries using views: A also change over time; hence, nodes and edges may migrate survey. The VLDB Journal, 10(4):270–294, 2001. between substructures. This has two major consequences: [11] P. Holme and J. Saramäki. Temporal networks. (a) Determining communities is by itself an expensive oper- Physics reports, 519(3):97–125, 2012. ation, even on non-temporal graphs, typically scaling much [12] H. Huang et al. Tgraph: A temporal graph data faster than the number of edges. We already begun investi- management system. In CIKM, pages 2469–2472. gating some light-weight methods on the basis of Brandes et ACM, 2016. al. [3] that promise to allow for incremental community de- [13] A. P. Iyer et al. Time-evolving graph processing at tection and evolution. (b) Sharing partial changelogs, snap- scale. In GRADES 2016, pages 1–6, 2016. shots or aggregates becomes more challenging when their [14] M. Junghanns et al. Gradoop: Scalable graph data membership or association to bigger structures changes. We management and analytics with hadoop. arXiv plan to investigate methods for more fine-grained partition- preprint arXiv:1506.00548, 2015. ing or association-strength partitioning strategies to mini- [15] M. Kaufmann et al. Timeline index: a unified data mize the number of large-scale recomputation. structure for processing queries on temporal data in sap hana. In SIGMOD, pages 1173–1184. ACM, 2013. 5. CONCLUSIONS [16] N. Kline and R. T. Snodgrass. Computing temporal Aggregations have important applications for relational aggregates. In ICDE, pages 222–231. IEEE, 1995. data, and they are used frequently. Likewise, there are many [17] A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: use cases for aggregations of temporal graph data. There is Large-scale graph computation on just a PC. In OSDI a broad foundation of graph databases and research of other 2012, pages 31–46, 2012. important temporal graph problems. Also, there are simi- [18] C. Liu, F. Wang, J. Hu, and H. Xiong. Temporal larities in the challenges of computation of streaming data phenotyping from longitudinal electronic health and the maintenance of materialized views, for example. records: A graph based framework. In ACM SIGKDD The research problem is four-fold: Graphs have differ- 2015, pages 705–714. ACM, 2015. ent (sometimes very costly) operations; it is necessary to [19] Y. Low et al. Distributed graphlab: A framework for find a set of operations that is both efficient and expressive. machine learning in the cloud. PVLDB, 5(8):716–727, Secondly, because some aggregations can be composed into 2012. others, it is wise to study how this can be done efficiently. [20] G. Malewicz et al. Pregel: a system for large-scale Thirdly, the time dimension and the graph structure give graph processing. In SIGMOD, pages 135–146, 2010. granularity. Lastly, the problem should be solved distribu- [21] V. Nicosia et al. Graph metrics for temporal networks. tive. In Temporal networks, pages 15–40. Springer, 2013. [22] M. Shapiro et al. Conflict-free replicated data types. 6. REFERENCES In Stabilization, Safety, and Security of Distributed [1] Neo4j. https://neo4j.com. Systems, pages 386–400. Springer, 2011. [2] M. Böhlen et al. Multi-dimensional aggregation for [23] R. T. Snodgrass, S. Gomez, and L. E. McKenzie. temporal data. In EDBT, pages 257–275, 2006. Aggregates in the temporal query language tquel. [3] U. Brandes et al. On modularity clustering. TKDE IEEE Transactions on Knowledge and Data 2007, 20(2):172–188, 2007. Engineering, 5(5):826–842, 1993. [4] R. Cervoni, A. Cesta, and A. Oddi. Managing [24] J. Tang et al. Applications of temporal graph metrics dynamic temporal constraint networks. In AIPS, to real-world networks. In Temporal Networks, pages pages 13–18, 1994. 135–159. Springer, 2013. [5] K. Cheng. On computing temporal aggregates over [25] K. Tangwongsan, M. Hirzel, and S. Schneider. null time intervals. In D. Benslimane et al., editors, Optimal and general out-of-order sliding-window DEXA, pages 67–79, Cham, 2017. Springer. aggregation. Proceedings of the VLDB Endowment, [6] R. Cheng et al. Kineograph: taking the pulse of a 12(10):1167–1180, 2019. fast-changing and connected world. In EuroSys 2012, [26] J. Wainer and S. Sandri. Fuzzy temporal/categorical pages 85–98, 2012. information in diagnosis. Journal of Intelligent [7] J. Dean and S. Ghemawat. Mapreduce: simplified Information Systems, 13(1-2):9–26, 1999. data processing on large clusters. CACM, [27] D. Zhang et al. Efficient computation of temporal 51(1):107–113, 2008. aggregates with range predicates. In PODS 2001, pages 237–245, 2001.