=Paper= {{Paper |id=Vol-2652/paper11 |storemode=property |title=Efficient and Scalable Aggregate Computation on Temporal Graphs |pdfUrl=https://ceur-ws.org/Vol-2652/paper11.pdf |volume=Vol-2652 |authors=Vincent Le Claire |dblpUrl=https://dblp.org/rec/conf/vldb/Claire20 }} ==Efficient and Scalable Aggregate Computation on Temporal Graphs== https://ceur-ws.org/Vol-2652/paper11.pdf
            Efficient and Scalable Aggregate Computation on
                            Temporal Graphs

                                                               Vincent Le Claire
                                                    supervised by Prof. Dr. Peter Fischer
                                                           University of Augsburg
                                                            Augsburg, Germany
                                         vincent.leclaire@informatik.uni-augsburg.de

ABSTRACT                                                                     a smaller set of meaningful metrics, as often seen in, e.g.,
Many applications such as social networks generate big vol-                  business intelligence.
umes of graph data that has additional temporal information                     In the context of aggregates in temporal graphs, consider
or changes rapidly, which has lead to a significant amount of                the following example: In a social network, users can be
research on temporal graphs. Existing and ongoing work on                    grouped into several communities that have stronger bounds
temporal graphs has focused on path problems and graph                       inside than to the outside. A query for this use case might
databases in general. Aggregations, which are very common                    be: “For the most connected member of each community,
for relational data, are just as insightful for temporal graph               how much did the number of friends change over the progress
data, but need to be computed efficiently and scalable. To                   of every calendar week of last year?” That query has sev-
aggreagte efficiently, useful operations on graphs need to                   eral complementary aspects: (a) It accesses the graph in
be selected, the composition of aggregation functions needs                  substructures of varying amount (nodes, neighbours, com-
to be investigated, and the distribution of the calculation                  munities), which resembles (but exceeds) the grouping of re-
must be studied. After tackling the research question with                   lational aggregations. (b) It has a temporal aspect of varying
the simplifying assumption of static communities (which are                  granularity (first, the number of friends is calculated for each
highly coupled nodes), then the parallelization of the prob-                 week; second, only the data of the last year is examined). (c)
lem is investigated, and finally, the simplifying assumption                 It uses different aggregation functions (here: sum, argmax).
is removed for the general case of dynamic communities. As                   (d) It re-uses an aggregation (the sum of the friends has to
aggregations play an important role in (temporal) relational                 be calculated for every node because it is needed to find the
data, this direction of research might establish them in tem-                most connected member in a first step).
poral graphs just as well.
                                                                             1.1    Motivation
                                                                               The core of this research proposal is to investigate ag-
1.    INTRODUCTION                                                           gregates on temporal graphs rather than arbitrary temporal
   Much data in the real world can naturally be expressed                    graph algorithms. There are several reasons for this focus:
as a graph. For example, in a social network, graph nodes
can represent people, and graph edges then represent that
                                                                                • There are many use cases of aggregations, among them,
two people are friends in this social network. As another
                                                                                  for example, weekly analytics of air traffic connections,
example, consider a traffic network that consists of streets
                                                                                  or averaging the number of contacts between people
that are interconnected by crossways. Many of these graphs
                                                                                  during a global outbreak of a disease. Furthermore,
are not static but change constantly. Social networks, as
                                                                                  many use cases of relational temporal aggregates are
they are intended for interaction, are obviously dynamic.
                                                                                  applicable for graphs, too. The importance of aggre-
Likewise, traffic networks are not static as temporary road
                                                                                  gates on temporal graph data is also recognized by
works or other means of blocked roads are sources of change.
                                                                                  their use in benchmarks: The LDBC Social Network
For temporal graphs (aka dynamic graphs), the wide range
                                                                                  Benchmark uses a temporal graph for its data, and its
of use cases results in a plethora of different definitions and
                                                                                  business intelligence workload sports several temporal
algorithms, as no common view or formalization has – so
                                                                                  graph data aggregations.
far – been achieved. One important aspect of (not only
temporal) graph data is the aggregation of many values to
                                                                                • For several temporal graph problems, there exist ef-
                                                                                  ficient, often parallel and/or incremental algorithms.
                                                                                  Yet, these algorithms tend to make specific assump-
                                                                                  tions on the data model and the workload; when the
                                                                                  uses cases are more general or differ slightly in their
                                                                                  assumptions, the scalability suffers. Considering the
                                                                                  more limited scope, temporal aggregates over graphs
                                                                                  are likely to provide more room for optimization and
Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo,
Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted        less sensitivity to workload and data model changes –
for private and academic purposes.                                                we will outline this reasoning later in this paper.
     • Despite all the points outlined above, aggregates over      change over time. A custom data structure called DPS (Dy-
       temporal graphs have not been a major research direc-       namic Property Storage) handles the dynamic attributes of
       tion so far. This is in stark contrast to classical graph   nodes and edges, while the graph itself and the static at-
       problems such as shortest path or adapting existing         tributes of the nodes and edges are kept in the existing Neo4j
       workloads on temporal data using snapshots.                 format.
                                                                      Google’s Pregel [20] is an early example of a graph pro-
2.    PREVIOUS WORK                                                cessing system, introduced in 2010. It is a distributed sys-
                                                                   tem for big graphs providing a very fine-grained concur-
2.1     Use Cases of Temporal Graphs                               rency model; like MapReduce [7], it relies on synchronization
                                                                   rounds (BSP). GraphLab [19] positions itself in the machine
   One of the first examples of using temporal graphs is
                                                                   learning domain, it thus investigates relaxed coordination
an article by Cervoni et al. [4]. The authors use temporal
                                                                   models. Kineograph [6] is a distributed graph system for dy-
graphs for the calculation of temporal constraints.
                                                                   namic (that means, frequently changing) graphs. It stores
   Use cases of temporal graph metrics have been mentioned
                                                                   its graph as snapshots for different points in time. Simi-
by Tang et al. [24]. Nicosia et al. [21] describe some metrics
                                                                   larly, GraphTau [13] is distributed and stores snapshots; it
and differences to their static graph counterparts.
                                                                   builds upon Apache Spark. GraphChi [17] focuses on disk-
   Holme and Saramäki [11] review many use cases for tem-
                                                                   based evolving graphs. A recent general-purposes system to
poral networks and describe the different modeling of them.
                                                                   analyze huge temporal graphs is Gradoop [14].
   There are examples of the usage of temporal graphs in
                                                                      Generally speaking, graph databases tend to be limited
medicine: Wainer and Sandri [26] use them for medicial di-
                                                                   in scalability for analytical temporal workloads, while the
agnostics. Liu et al. [18] model medical events of a single
                                                                   broad support for temporal graph operations in processing
patient and their relationship using temporal graphs.
                                                                   system makes it hard to devise optimization specific to ag-
   For visualizing scientific publications and their interrela-
                                                                   gregations.
tions, Erten et al. [8] use temporal graphs.
   The wide range of uses of temporal graphs (along with our       2.4      Incremental Aggregate Maintenance
motivating example) reassures us that the addition of tem-
                                                                      Efficient computation of graph aggregates may draw heav-
poral information to graphs is sensible and not an academic
                                                                   ily from existing work on incremental aggregate computa-
niche.
                                                                   tion, notably from (a) incremental view maintenance and
2.2     Temporal Databases and Aggregates                          (b) incremental computation of streaming aggregates.
                                                                      Materialized views are well established in relational data-
   The research on temporal aggregation dates back to the
                                                                   bases; Halevy [10] gives an extensive overview of how they
1990s, co-occuring to the seminal work on temporal data-
                                                                   are used to answer queries. Materialized views contain the
bases. One of the first publications regarding temporal ag-
                                                                   result of a query, which can be, for example, an aggregation.
gregation (on relational databases) is by Snodgrass et al. [23],
                                                                   When the underlying data changes, the materialized view
which is a direct result of their introduction of TSQL2.
                                                                   is updated. Many approaches exist to perform this in an
TSQL2 added temporal semantics on top of the SQL stan-
                                                                   incremental manner. There is an additional interesting line
dard, which has become mostly obsolete by the inclusion
                                                                   of work that is helpful for our problem setting: Answering
of temporal semantics into SQL:2011. Following, Kline and
                                                                   queries with views and thus view containment could provide
Snodgrass further evaluate the calcuation of temporal ag-
                                                                   valuable insights on stacking distinct aggregates – which is
gregates in [16]. Zhang et al. [27] appended ranges to the
                                                                   clearly harder than, for example, a cube operator with the
temporal aggregates.
                                                                   same aggregation function.
   Böhlen et al. [2] focus on multi-dimensional temporal ag-
                                                                      Tangwongsan et al. [25] provide a comprehensive solution
gregates. They distinguish between two (or more) time di-
                                                                   on sliding-window aggregations. For in-order arrival/expiry
mensions: one or more application times, which describe at
                                                                   of the data they achieve O(1), while for out-of-order expiry
which time a tuple is valid, and a system time, which defines
                                                                   they still achieve O(1) in the best case (which is in-order)
the time where the database system is aware of the tuple.
                                                                   and up to O(log n) in total out-of-order execution. The lat-
Cheng [5] describes the aggregation of null-time intervals.
                                                                   ter model matches well with arbitrary lifetime intervals of
   In 2013, Kaufmann et al. [15] introduce the Timeline In-
                                                                   temporal data.
dex, which is a main-memory structure that efficiently sup-
ports temporal aggregations on system time and other tem-
poral operations in a relational database. Ideas of the Time-      3.     PROBLEM DEFINITION
line Index will play an important role in our first step of re-       The problem we are thus trying to tackle in our work is to
search, as we will show in Section 4. Other work mentioned         efficiently compute (combinations of) aggregate values over
in this section lays the foundation to understand what tem-        (possibly changing) substructures of highly dynamic, huge
poral aggregations are and what expressive power they have.        graphs.
                                                                      The implications of this problem statement can further be
2.3     Processing Systems and Databases                           broken down along the following dimensions:
  For our purposes, managing large-scale evolving graphs
relates to both graph databases and processing systems:                 1. Graph properties to aggregate: Compared to re-
  Fernandes and Bernardino [9] compare several graph data-                 lational or streaming/ordered models, graphs allow for
base systems. The most popular among them is Neo4j [1].                    an additional range of (possibly very costly and hard-
TGraph [12] is an ACID-compliant extension of Neo4j for                    to-optimize) operations such as reachability or shortest
temporal range queries. TGraph assumes that the graph                      paths. These may serve as an input to the aggregate
structure changes rarely, while attributes of nodes and edges              metrics (such as betweenness centrality), yet the effort
        to compute them may outstrip the cost of the aggre-        the underlying structure of a graphs is one of the most in-
        gations.                                                   teresting challenges when adapting Timeline, as it affects
        It is therefore an important tradeoff to consider how      the scope of aggregations (e.g. over nodes or communities)
        rich the support for such operations needs to be. Our      as well as the means of partitioning the graph for parallel
        current take is to allow limited neighborhood access       execution. To make this challenge more tractable, we first
        but not arbitrary reach/iteration.                         keep the aggregation scopes static and drop this require-
                                                                   ment in our last step. Additionally, we consider also ma-
     2. Hierarchical composition of distinct aggregate             terializing (partial) aggregate values to speed up aggregate
        functions: While a significant body of works exists on     computation over time, graph structure and aggregate func-
        refining/combining values over the same aggregation        tion composition. In turn, this also introduces an additional
        function (like drill-down or roll-up data cubes), using    partitioning problem.
        several distinct aggregation functions “on top” of each
        other (e.g. maximum on top of sum in the example)          4.1   Interfacing with Graph Computations
        is not as well studied for effective evaluation.             As a first step, we are investigating how to best express
     3. Several dimensions of varying granularity                  the usage of graph properties required for the metric com-
                                                                   putation. The goal is to define a suitable subset of graph
        (a) Time: As already observed, temporal aggregates         operations that can be computed in an incremental manner
            may cover varying degrees of time (points, dis-        over temporal data and can therefore “drive” an efficient
            joint/overlapping intervals). In aggregation hier-     temporal aggregation process. While some operations are
            archies, the granularities may necessarily have a      obviously both useful for metrics and easy to derive (such
            containment relationship (e.g. weeks and months).      property values on a single node or edge) and others are use-
                                                                   ful, but extremely hard to compute incrementally (general
        (b) Graph structure: While relational data is typ-         temporal paths), the space in between is not well-charted.
            ically “grouped” over an attribute or combina-         We plan to investigate a broad range of metrics from use
            tions of attributes (which often have an obvious       cases in order to further understand the requirements. The
            containment relationship), graphs provide a wider      results will allow use to also adapt the design of the graph
            range of options, covering individual nodes, neigh-    data storage and programming interface.
            borhoods, communities, connected components or
            the entire graph. Determining these “groups”
            may itself be a complex and expensive operation.       4.2   Composition of Aggregates
            Furthermore, on temporal data the membership              A second, but orthogonal problem is the combination of
            of sets of lower-level groups may change over time,    multiple aggregations into a hierarchy. While a wide range
            such as nodes changing their community, making         of single-level aggregations may be supported for incremen-
            partial computation harder to maintain.                tal computation by applying ideas of, e.g., Tangwongsan et
                                                                   al. [25] (maybe with extensions for update sets), re-using
     4. Distributed computation: The expected problem              partial results over a DAG of distinct aggregation functions
        size makes single-thread, main-memory approaches as        clearly raises its own set of challenges. We plan to inves-
        well as full recomputation on change not very appeal-      tigate expression or view containment approaches to con-
        ing. While these aspects are fairly well-understood        sider both dependencies among aggregation functions and
        for relational streaming and ongoing work exist for        the varying granularities of time and graph structure, pos-
        static graphs, research for partitioning and distributed   sibly deriving “core” aggregation parts in lower levels that
        state management of dynamic graphs is still in its early   can be shared for multiple aggregates or combined among
        stages.                                                    these dimensions. Furthermore, deciding on when and what
                                                                   to materialize will be further area of investigation.
   To narrow the problem definition, we assume a changing
graph whose nodes and edges may have additional proper-            4.3   Parallelization
ties assinged that may also change over time, similar to the
                                                                     In order to achieve a significant amount of scalability, the
approach Huang et al. have for TGraph [12]. To give more
                                                                   computations need to be parallelized.
expressive power, we assume an interval time model instead
                                                                     We see two main directions: (a) Partitioning (social) graph
of only allowing points in time for queries. At this time, we
                                                                   data in order to both maximize parallel computation and
do not assume a graph with directed nor undirected edges.
                                                                   minimize communication is an ongoing challenge in the re-
                                                                   search community, e.g., due to skewed distribution of the
4.     RESEARCH PLAN                                               data. We expect that some parts of the changelog, the snap-
   An interesting foundation for this type of research is the      shots, and the partial aggregates may just cover a small sub-
Timeline Index approach of Kaufmann et al. [15]. It ex-            set of highly active and highly connected nodes, while other
presses the (relational) temporal data as a changelog of acti-     parts may cover larger sets of lower activity. We also may
vated/deactived data points. Validity information snapshots        have to investigate possible “cuts” within graph structures
reduce the space to be processed and allow for archival or         such as communities or even nodes, so partial aggregates
distribution. Due to these design choices and also due to          can be computed with more parallelism.
the optimization for main memory, it provides an efficient           (b) In contrast to general graph computations, many ag-
underpinning for incremental aggregations.                         gregate functions do not require strict consistency rules in
   Temporal graphs can clearly also be expressed as a (set)        order to produce correct results, similar to what CRDTs [22]
of changelog(s) of nodes, edges and properties. Dealing with       can achieve in eventual consistency environements. The
temporal validity information provided with the data ele-          [8] C. Erten et al. Exploring the computing literature
ments facilitates the reconciliation of different “episodes”           using temporal graph visualization. In Visualization
and may further be used to derive the synchronization bound-           and Data Analysis 2004, volume 5295, pages 45–56.
aries as only changes in the data require actual coordination.         International Society for Optics and Photonics, 2004.
                                                                   [9] D. Fernandes and J. Bernardino. Graph databases
4.4    Dynamic Communities                                             comparison: Allegrograph, arangodb, infinitegraph,
   In our last step, we drop the assumption of static graph            neo4j, and orientdb. In DATA, pages 373–380, 2018.
structures, so that the graph structures we aggregate over        [10] A. Y. Halevy. Answering queries using views: A
also change over time; hence, nodes and edges may migrate              survey. The VLDB Journal, 10(4):270–294, 2001.
between substructures. This has two major consequences:           [11] P. Holme and J. Saramäki. Temporal networks.
(a) Determining communities is by itself an expensive oper-            Physics reports, 519(3):97–125, 2012.
ation, even on non-temporal graphs, typically scaling much        [12] H. Huang et al. Tgraph: A temporal graph data
faster than the number of edges. We already begun investi-             management system. In CIKM, pages 2469–2472.
gating some light-weight methods on the basis of Brandes et            ACM, 2016.
al. [3] that promise to allow for incremental community de-       [13] A. P. Iyer et al. Time-evolving graph processing at
tection and evolution. (b) Sharing partial changelogs, snap-           scale. In GRADES 2016, pages 1–6, 2016.
shots or aggregates becomes more challenging when their           [14] M. Junghanns et al. Gradoop: Scalable graph data
membership or association to bigger structures changes. We             management and analytics with hadoop. arXiv
plan to investigate methods for more fine-grained partition-           preprint arXiv:1506.00548, 2015.
ing or association-strength partitioning strategies to mini-
                                                                  [15] M. Kaufmann et al. Timeline index: a unified data
mize the number of large-scale recomputation.
                                                                       structure for processing queries on temporal data in
                                                                       sap hana. In SIGMOD, pages 1173–1184. ACM, 2013.
5.    CONCLUSIONS                                                 [16] N. Kline and R. T. Snodgrass. Computing temporal
   Aggregations have important applications for relational             aggregates. In ICDE, pages 222–231. IEEE, 1995.
data, and they are used frequently. Likewise, there are many      [17] A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi:
use cases for aggregations of temporal graph data. There is            Large-scale graph computation on just a PC. In OSDI
a broad foundation of graph databases and research of other            2012, pages 31–46, 2012.
important temporal graph problems. Also, there are simi-          [18] C. Liu, F. Wang, J. Hu, and H. Xiong. Temporal
larities in the challenges of computation of streaming data            phenotyping from longitudinal electronic health
and the maintenance of materialized views, for example.                records: A graph based framework. In ACM SIGKDD
   The research problem is four-fold: Graphs have differ-              2015, pages 705–714. ACM, 2015.
ent (sometimes very costly) operations; it is necessary to        [19] Y. Low et al. Distributed graphlab: A framework for
find a set of operations that is both efficient and expressive.        machine learning in the cloud. PVLDB, 5(8):716–727,
Secondly, because some aggregations can be composed into               2012.
others, it is wise to study how this can be done efficiently.
                                                                  [20] G. Malewicz et al. Pregel: a system for large-scale
Thirdly, the time dimension and the graph structure give
                                                                       graph processing. In SIGMOD, pages 135–146, 2010.
granularity. Lastly, the problem should be solved distribu-
                                                                  [21] V. Nicosia et al. Graph metrics for temporal networks.
tive.
                                                                       In Temporal networks, pages 15–40. Springer, 2013.
                                                                  [22] M. Shapiro et al. Conflict-free replicated data types.
6.    REFERENCES                                                       In Stabilization, Safety, and Security of Distributed
 [1] Neo4j. https://neo4j.com.                                         Systems, pages 386–400. Springer, 2011.
 [2] M. Böhlen et al. Multi-dimensional aggregation for          [23] R. T. Snodgrass, S. Gomez, and L. E. McKenzie.
     temporal data. In EDBT, pages 257–275, 2006.                      Aggregates in the temporal query language tquel.
 [3] U. Brandes et al. On modularity clustering. TKDE                  IEEE Transactions on Knowledge and Data
     2007, 20(2):172–188, 2007.                                        Engineering, 5(5):826–842, 1993.
 [4] R. Cervoni, A. Cesta, and A. Oddi. Managing                  [24] J. Tang et al. Applications of temporal graph metrics
     dynamic temporal constraint networks. In AIPS,                    to real-world networks. In Temporal Networks, pages
     pages 13–18, 1994.                                                135–159. Springer, 2013.
 [5] K. Cheng. On computing temporal aggregates over              [25] K. Tangwongsan, M. Hirzel, and S. Schneider.
     null time intervals. In D. Benslimane et al., editors,            Optimal and general out-of-order sliding-window
     DEXA, pages 67–79, Cham, 2017. Springer.                          aggregation. Proceedings of the VLDB Endowment,
 [6] R. Cheng et al. Kineograph: taking the pulse of a                 12(10):1167–1180, 2019.
     fast-changing and connected world. In EuroSys 2012,          [26] J. Wainer and S. Sandri. Fuzzy temporal/categorical
     pages 85–98, 2012.                                                information in diagnosis. Journal of Intelligent
 [7] J. Dean and S. Ghemawat. Mapreduce: simplified                    Information Systems, 13(1-2):9–26, 1999.
     data processing on large clusters. CACM,                     [27] D. Zhang et al. Efficient computation of temporal
     51(1):107–113, 2008.                                              aggregates with range predicates. In PODS 2001,
                                                                       pages 237–245, 2001.