=Paper= {{Paper |id=Vol-2652/paper10 |storemode=property |title=Tunable Streaming Graph Embeddings at Scale |pdfUrl=https://ceur-ws.org/Vol-2652/paper10.pdf |volume=Vol-2652 |authors=Serafeim Papadias |dblpUrl=https://dblp.org/rec/conf/vldb/Papadias20 }} ==Tunable Streaming Graph Embeddings at Scale== https://ceur-ws.org/Vol-2652/paper10.pdf
               Tunable Streaming Graph Embeddings at Scale

                                                              Serafeim Papadias
                                                         Technische Universität Berlin
                                                       supervised by Prof. Volker Markl
                                                         s.papadias@tu-berlin.de



ABSTRACT                                                                     Characteristic tasks include vertex classification, link predic-
An increasing number of real-world applications require ma-                  tion, link reconstruction, topic modeling, and, community,
chine learning tasks over large-scale streaming graphs, where                anomaly, fraud, and outlier detection.
nodes and edges are continuously being added or deleted.                        A very popular and effective technique for solving such
Graph embeddings have been widely used for solving such                      ML tasks are graph node embeddings (a.k.a network repre-
tasks by capturing the graph structure and features into a                   sentation learning). Given a graph G = (V, E) with n nodes,
low-dimensional latent space. However, current approaches                    a graph embedding maps each node v ∈ G to a compact fea-
have one or more of the following disadvantages: (i) they                    ture vector in a lower k dimentional space (k  n), which
are designed for either static or dynamic graphs and thus,                   captures graph structure and properties in the vicinity of
need retraining after each graph change or periodically up-                  v. These vectors are derived by optimizing objective func-
dating the embeddings after each snapshot arrival, (ii) they                 tions preserving geometric relationships among graph nodes.
fail to scale to today’s size of graphs composed of billions                 Computing embeddings is a crucial problem by itself, as they
of nodes, or (iii) yet the ones devised for streaming graphs                 serve as inputs to downstream ML tasks mentioned above.
perform redundant retraining computations by mandating                          The frequency of updating the embedding vectors plays a
continuous embedding updates even if the accuracy is not                     significant role on runtime performance and accuracy, and
improved. The goal of this thesis is to overcome the above-                  should be driven by the subsequent ML task. For example,
mentioned problems by devising tunable streaming methods                     critical downstream applications, such as anomaly detection,
that can scale to massive graphs. We envision an end-to-end                  should instantly react to graph changes for capturing all
ML streaming system that achieves that goal and provides                     anomalies; hence, their input embeddings must constantly
users with abstractions to easily define their own streaming                 remain up-to-date for producing accurate results. For the
embedding algorithms.                                                        above scenario, static embedding methods, such as [9], are
                                                                             unsuitable, as they need to retrain embeddings from scratch
                                                                             after each graph change. Dynamic techniques, such as [6]
                                                                             are not sufficient either, as they update embeddings period-
1.    INTRODUCTION                                                           ically after each snapshot arrival; hence, fall short of discov-
   Graphs are omnipresent in various domains such as so-                     ering anomalies appearing during the idle period between
cial media, transportation, finance, IoT, and biological net-                two graph snapshots. Even though streaming algorithms [7,
works. Typically, real-world graphs are inherently dynamic,                  10] refine the embedding vectors after every graph update,
entailing continuous additions and deletions of vertices and                 this can be potentially unecessary as graph structure may
edges. The frequency of these graph updates lies on a spec-                  not be substantially altered to affect the accuracy of the
trum. In dynamic graphs, updates appear in batches as                        downstream ML task. Thus, anomaly detection and simi-
graph snapshots and are applied periodically. In streaming                   lar continuous ML tasks dictate flexible streaming solutions
graphs, updates arrive spontaneously and are incorporated                    that are tunable: they can adjust the frequency with which
on-the-fly. For instance, social networks that model friend-                 the embeddings are being updated.
ships between users are highly dynamic, while citations net-                    Nowadays, real-world graphs not only dictate (tunable)
works modelling relations among scholars in academic net-                    streaming algorithms due to their ever-changing nature, but
works are less dynamic. Many real-world applications, such                   also scalable solutions because of their massive size. How-
as news recommendation or crime detection, can be mod-                       ever, the majority of existing embedding techniques, either
eled as machine learning (ML) tasks over streaming graphs.                   the ones concerning static graphs [3, 9, 11, 12] or the ones de-
                                                                             signed for dynamic graphs [6, 7, 10], are centralized; hence,
                                                                             they do not scale on massive graphs. Ideally, scalability
                                                                             should be achieved through the distribution of processing
                                                                             on clusters of commodity machines; thus, avoiding solu-
                                                                             tions that utilize expensive servers machines with several
                                                                             terabytes of main memory or unaffordable GPUs. Few avail-
                                                                             able scalable graph embeddings systems [5] exist; however,
Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo,
                                                                             they are unable to operate on evolving graphs. Thus, there
Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted   is a lack of solutions that can both achieve scalability and
for private and academic purposes.
conduct streaming graph embedding processing.                       Kaixis System                               Graph ML Library
  As ML and graph embeddings in particular become more
and more popular there is a need for systems facilitating                                Embedding Primitives
the development of such algorithms by hiding the system
complexity from the users. For instance, how the system             Stateful Walker

distributes the processing of defined algorithms should be              Random Walker     Training Monitor         QoS & Results

agnostic to the users. There are few systems with this goal:
they provide primitives for distributed random walk com-                Walk Storage      Embedding Trainer        Task Trainer
putation on static graphs [13] or for graph neural network
computation [14, 15]. However, no such comprehensive sys-
tem for (tunable) streaming graph embeddings exists.                         Figure 1: System architecture of Kaixis.
  In this thesis, we strive to: (i) devise tunable streaming
methods that generate embeddings incrementally by adjust-
ing the frequency of vector updates, (ii) build a scalable         maintains first and second-order random walks in a stream-
solution which processes streaming embedding algorithms            ing graph is proposed; however, it is centralized and lacks of
in a distributed manner, and (iii) provide abstractions that       theoretical guarantees of correctness.
ideally incorporate distinct classes of streaming embedding           Walk-based embedding algorithms fall short mainly in two
methods. Our goal is to synthesize these solutions into            crucial aspects. First, they incur computational costs pro-
Kaixis1 , a novel end-to-end system, capable of bridging the       portional to the number of deployed walks; rendering them
gap between streaming and distributed embedding methods.           insufficient for massive graphs, especially the widespread
                                                                   centralized solutions. Instead, massively-parallel and ideally
                                                                   cost-effective solutions, such as distributing tasks into clus-
2.      RELATED WORK                                               ters of commodity machines, are required. Secondly, there is
                                                                   no well-established streaming walk-based technique, which
   Representation learning on graphs received huge attention       has robust theoretical analysis. In contrast, we aim for (tun-
from researchers in the past few years. There are three main       able) streaming and distributed walk-based embeddings that
categories of graph embeddings based on: (a) random-walks,         are also theoretically established.
(b) matrix-factorization, and (c) deep learning. In what fol-         Walk-based Systems. KnightKing [13] is a distributed
lows, we focus on the first category as the other two incur        system specific for computing random walks on static graphs.
extremely high costs rendering them unsuitable for stream-         It offers a walker-centric computation model, which is able
ing scenarios, i.e., deep learning-based train directly on the     to express various walk algorithms. Its extreme efficiency
whole graph and factorization-based suffer from expensive          is attributed to the rejection sampling it utilizes, especially
matrix operations [6]. We mainly review random walk-based          when computing cumbersome higher-order walks. However,
(dynamic) network embedding algorithms and systems.                KnightKing is incapable of computing streaming random
   Walk-based Algorithms. Embedding methods based                  walks, as it requires the whole graph in advance. Except
on random walks mainly consist of two phases: the ran-             for academia, industry has also shown huge interest on scal-
dom walking that explores, samples and captures certain            able systems for the graph neural networks (GNNs) [14, 15],
properties of the graph, and the training that subsequently        which are relevant to graph embeddings. Aligraph [15] pro-
ingests the produced random walks and trains embedding             vides primitive operators that abstract common concepts
vectors e.g., using the well-known Skip-Gram [8] model.            among GNN algorithms, facilitating users in implementing
Depending on the random walk type, different properties            and deploying them. AGL [14] builts upon k-hop neighbor-
are captured. Specifically, DeepWalk [9] performs trun-            hoods2 and utilizes MapReduce and Parameters Servers to
cated random walks to preserve first-order proximities in          speed up GNN training; a relatively straightforward task
a graph, whereas node2vec [3] deploys biased second-order          due to the linear nature of GNNs. As far as we know,
walks capturing second-order proximities. LINE [11] opti-          there is no comprehensive system that provides primitives
mizes an objective function that preserves both first-order        for streaming embedding algorithms.
and second-order proximities while deriving the embedding
vectors. GraRep [1] captures higher-order proximities in the
final embedding vectors. GraphCSC [2] deploys centrality-          3.      KAIXIS ARCHITECTURE
based walks capable of learning embeddings that preserve             In this section we introduce Kaixis, our envisioned end-
graph characteristics, such as degree and betweenness, and         to-end system for computing streaming random walk-based
finally aggregates them into one vector. Nevertheless, all         graph embeddings on the fly and at scale. Figure 1 shows the
the techniques above are designed for static graphs; hence,        system architecture consisting of seven main components: a
are unable to adapt in streaming scenarios that we focus on.       Graph ML Library, the Embedding Primitives, a Stateful
   Interestingly enough, only [7, 10] address the problem of       Walker, an Embedding Trainer, a Task Trainer, a Training
streaming graph embeddings. In [7], a rather ad-hoc influ-         Monitor, and the Results and QoS module.
ence propagation model is used for locating nodes whose              The input of the system consists of continuous ML tasks,
embedding vector is influenced by a graph change (either           such as anomaly detection, along with a graph stream source
addition or deletion). Also, a vertex stream is assumed, i.e.,     and certain user-defined requirements (e.g., accuracy > 70%).
each new node arrives along with its complete adjacency list,      Kaixis perpetually computes tunable streaming graph em-
which is a limitation. In [10], a method that incrementally        bedding vectors, which are subsequently forwarded as inputs
                                                                   2
                                                                     For vertex v, the set of vertices reachable from v within at
1
    From the turkish word Kayikçi; the owner of a fishing boat.   most k steps.
to high stakes ML tasks. The output of the system con-            defined requirements. The frequency of Embedding Trainer
sists of (i) real-time results for the high-stakes tasks, and     should be driven by the downstream (possibly critical) ML
(ii) quality of service (QoS) metrics, as depicted in Figure 1.   tasks, such that the embeddings are kept up-to-date. Thus,
Kaixis addresses two types of users: end-users and develop-       the embedding training is not everlasting but tuned in real-
ers. End-users interact with the system through the Graph         time by the Training Monitor. For instance, if the user wants
ML Library for specifying their ML task and QoS require-          to detect outliers in an evolving graph, the embedding vec-
ments. Developers use the flexible Embedding Primitives           tors should be updated just as frequently as it is necessary
for defining tunable streaming embedding algorithms. In a         for capturing all outliers instantly. In other words, if an up-
nutshell, after users submit their queries, Kaixis follows con-   date in the embedding vector does not yield any change in
crete steps. Namely, the Stateful Walker constantly explores      the prediction results, it should not be performed to avoid
the evolving graph and keeps the stored random walks up-          unnecessary computation.
to-date. Subsequently, the Embedding Trainer updates the            Results & QoS. This component serves as a reporting
existing embedding vectors on-the-fly, based on the walks         unit gathering results and QoS metrics of the ML task from
received from the Stateful Walker. Then, the Task Trainer         the Task Trainer. QoS consist of accuracy metrics, e.g., area
uses the embeddings to produce query results. The Training        under the curve (AUC) and micro-F1 score, and performance
Monitor has the power to switch on and off either of the Em-      measurements, such as throughput and execution time.
bedding and Task Trainer. It selectively enables retraining
of existing embeddings and/or ML task models only when            4.    THE RESEARCH ROAD AHEAD
needed to avoid excessive computation. In the following, we
detail each component.                                               Our goal is to serve dynamic applications that can lever-
Graph ML Library. End-users interact with Kaixis via              age graph embeddings used for retrieving information from
this library, which is a collection of possibly pre-configured    an evolving graph. In essence, Kaixis extracts embeddings
algorithmic operators directly invoked without necessitating      from a graph stream in a tunable way and feeds them to
hyper-parameter tuning. These operators solve tasks, such         downstream ML tasks. The system can operate at the finest
as link prediction, anomaly detection, fraud detection, out-      granularity, i.e., always derive the latest vectors and train
lier detection, graph reconstruction and vertex classification.   the latest ML model of a task. However, its profound goal
Embedding Primitives. Developers interact with Kaixis             is actually to avoid excessive training and instead strive for
through the Embedding Primitives, which enable them to            continuously adjusting retraining frequency by monitoring
easily implement, integrate and deploy streaming embed-           the QoS metrics.
ding methods, without having to worry about how the dis-
tribution is handled by Kaixis. Additionally, any optimized
                                                                  4.1    Research Challenges
implementation of a primitive, also speeds up every method          Realizing Kaixis is far from straightforward. Below, we
that utilizes this primitive. Hence, Embedding Primitives         highlight five research challenges:
unify optimizations of distinct embedding methods.
Stateful Walker. Random walks are core concepts of                (1) Streaming Random Walks. Maintenance of random
graph embeddings. A Stateful Walker in Kaixis, consists           walks on evolving graphs should not compute all walks from
of two parts: the Random Walker and the Walk Storage.             scratch after a graph update, but only revise the already
The former constantly explores the ingested graph stream          kept walk corpus. Most importantly, the refined walk corpus
and produces new random walks for newly appearing nodes           at time t + 1 should be statistically equivalent to the corpus
or updates walks attributed to already existing nodes. The        at time t. Equivalently, the updated walk corpus at time
Walk Storage unit is responsible for efficiently storing the      t + 1 should have the same probability of being produced
latest random walk corpus.                                        as a corpus derived from scratch by totally recalculating
Embedding Trainer. As shown in Figure 1, the random               all the walks. Different policies can be used for updating
walks produced (either newly formed or new parts of modi-         walks [10], but clearly much remains to be done; both on
fied ones) are forwarded to the Embedding Trainer to refine       the theoretical side for coming up with sound policies and
and output the latest embedding vectors by conducting in-         on the performance side via distribution.
cremental training. In doing so, the trainer hosts a variety      (2) Scalable Random Walks. The huge magnitute and
of training algorithms in its artillery e.g., online Skip-Gram    the high dynamicity of nowadays graph data renders cen-
and Stochastic Gradient Descent models.                           tralized random walk calculation highly insufficient. Yang
Task Trainer. The user’s selected ML application e.g.,            et al. [13] crafted a whole system solely dedicated to dis-
anomaly detection, has to be executed in a continuous way.        tributed random walks computation. Adapting their ideas
We thus opt for online ML algorithms where the training of        to streaming graphs is far from trivial. One cannot afford
the model is performed in an online fashion similarly with        to store the entire graph stream in a streaming setting. In
the serving part. Kaixis deploys the Task Trainer to update       addition, one should carefully distribute the graph on-the-
on-the-fly the model of the specified ML task and finally         fly to facilitate the walk calculation by cluster nodes, while
output the prediction results.                                    avoiding excessive communication. Network communication
Training Monitor. Both Embedding and Task Trainer                 is too costly, therefore acute streaming graph partitioning
perform online training. However, it is important to note         should ensure extremely scalable streaming random walks.
that performing blindly online training, i.e., updating the
                                                                  (3) Walk Storage Sharing. In graph node embeddings,
model after every single change in the streaming graph may
                                                                  numerous random walks are created for each single vertex,
lead to unnecessary excessive computation and thus, de-
                                                                  resulting in walk sets with overlapping parts. Since Kaixis
grade the system’s performance. Specifically, the frequency
                                                                  needs to maintain a random walk corpus, effective techniques
of Task Trainer should be large enough to satisfy the user-
                                                                  that store overlapping walk parts only once are crucial. The
real challenge, is to come up with compression schemes for            Acknowledgments. The author would like to thank
succinct representation of the whole walk corpus. Ideally,          Prof. Volker Markl and Dr. Zoi Kaoudi for their pristine
the compression should be lossless and enable processing of         guidance, as well as Dr. Eleni Tzirita Zacharatou for her
walks in their compressed form without the need for de-             invaluable feedback. This work was funded by the German
serialization. Finally, the streaming setting increases the         Ministry for Education and Research as BIFOLD - Berlin
complexity of the problem, as it dictates the possibility of        Institute for the Foundations of Learning and Data (ref.
updating the walks while in compressed form.                        01IS18025A and ref. 01IS18037A).
(4) Monitoring. The Training Monitor is the brain of
our conceived system: It tunes the frequency of retraining          5.   REFERENCES
performed by either the Embedding Trainer or the Task                [1] S. Cao, W. Lu, and Q. Xu. GraRep: Learning Graph
Trainer. A number of research questions arise, such as:                  Representations with Global Structural Information.
(i) when should the Training Monitor trigger the Embedding               In CIKM, pages 891–900, 2015.
and Task Trainers, e.g., periodically or based on a certain          [2] H. Chen, H. Yin, T. Chen, Q. V. H. Nguyen,
reasonable mechanism, (ii) what is the impact of a trainer               W. Peng, and X. Li. Exploiting Centrality Information
that is disabled to the final ML task result, and (iii) how each         with Graph Convolutions for Network Representation
specific ML task chosen affects the monitor’s decisions, i.e.,           Learning. In ICDE, pages 590–601, 2019.
high stakes ML tasks would differ from non-critical ones.            [3] A. Grover and J. Leskovec. Node2vec: Scalable
Clearly, the monitor’s behaviour is driven by downstream                 Feature Learning for Networks. In KDD, page
ML tasks.                                                                855–864, 2016.
(5) Primitives. To facilitate users implementing, integrat-          [4] N. Kaji and H. Kobayashi. Incremental Skip-gram
ing and deploying streaming embedding methods, Kaixis                    Model with Negative Sampling. In EMNLP, pages
should offer primitives that hide the implementation details             363–371, 2017.
of how distribution is handled. Designing such primitives            [5] A. Lerer, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt,
is challenging as it implies breaking various embedding al-              A. Bose, and A. Peysakhovich. PyTorch-BigGraph: A
gorithms down to “atoms”. In Kaixis, primitives are ex-                  Large-scale Graph Embedding System. In SysML,
ecuted as extremely performant streaming and distributed                 page 285–296, 2019.
random walk-based operations. These abstractions offer the           [6] J. Li, H. Dani, X. Hu, J. Tang, Y. Chang, and H. Liu.
potential for transparent optimizations, i.e., optimizations             Attributed Network Embedding for Learning in a
to a primitive used by many algorithms, end up optimizing                Dynamic Environment. In CIKM, page 387–396, 2017.
them all in one shot.                                                [7] X. Liu, P.-C. Hsieh, N. Duffield, R. Chen, M. Xie, and
                                                                         X. Wen. Real-Time Streaming Graph Embedding
4.2    Research Plan                                                     Through Local Actions. In WWW, page 285–293,
   To conclude, we present our research strategy for tackling            2019.
the aforementioned challenges and realizing Kaixis.                  [8] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
   Streaming and Scalable Random Walks. The first                        J. Dean. Distributed Representations of Words and
step is to design the Stateful Walker for calculating stream-            Phrases and their Compositionality. In NIPS, pages
ing random walks using streaming graph partitioning and                  3111–3119, 2013.
distribution. To assist our goal we plan to use an efficient
                                                                     [9] B. Perozzi, R. Al-Rfou, and S. Skiena. DeepWalk:
and succinct walk storage representation. Finally, we plan
                                                                         Online Learning of Social Representations. In KDD,
to establish the Stateful Walker theoretically by: (i) proving
                                                                         page 701–710, 2014.
the statistical equivalence of walk update policies, and (ii) de-
riving complexity bounds for procedures updating walks.             [10] H. P. Sajjad, A. Docherty, and Y. Tyshetskiy. Efficient
   Monitored Embedding Training. On the one hand,                        Representation Learning Using Random Walks for
we plan to investigate relevant online training algorithms               Dynamic Graphs. CoRR, abs/1901.01346, 2019.
proposed in the literature (e.g. [4]) and thoroughly evaluate       [11] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and
them to decide which ones to incorporate in the Embedding                Q. Mei. LINE: Large-Scale Information Network
Trainer’s artillery. On the other hand, we plan to design                Embedding. In WWW, page 1067–1077, 2015.
the Training Monitor such that it configures frequency of           [12] D. Wang, P. Cui, and W. Zhu. Structural Deep
retraining/updating embeddings, driven by the importance                 Network Embedding. In KDD, page 1225–1234, 2016.
of the downstream ML tasks; for high stakes tasks, this fre-        [13] K. Yang, M. Zhang, K. Chen, X. Ma, Y. Bai, and
quency is larger. To achieve scalability, both the Embedding             Y. Jiang. KnightKing: A Fast Distributed Graph
Trainer and the Training Monitor should be carefully con-                Random Walk Engine. In SOSP, page 524–537, 2019.
jured to operate in a fully distributed manner, also aiming         [14] D. Zhang, X. Huang, Z. Liu, Z. Hu, X. Song, Z. Ge,
on minimizing communication between them.                                Z. Zhang, L. Wang, J. Zhou, and Y. Qi. AGL: a
   Powerful Primitives. Our perpetual goal along this en-                Scalable System for Industrial-purpose Graph Machine
deavour is to conjure expressive primitive abstractions that             Learning. arXiv preprint arXiv:2003.02454, 2020.
facilitate users. Finally, we strive for devising effective op-     [15] R. Zhu, K. Zhao, H. Yang, W. Lin, C. Zhou, B. Ai,
timizations for each primitive, as various methods enclosing             Y. Li, and J. Zhou. AliGraph: A Comprehensive
such a primitive will benefit simultaneously.                            Graph Neural Network Platform. VLDB,
                                                                         12(12):2094–2105, 2019.