=Paper=
{{Paper
|id=Vol-2652/paper10
|storemode=property
|title=Tunable Streaming Graph Embeddings at Scale
|pdfUrl=https://ceur-ws.org/Vol-2652/paper10.pdf
|volume=Vol-2652
|authors=Serafeim Papadias
|dblpUrl=https://dblp.org/rec/conf/vldb/Papadias20
}}
==Tunable Streaming Graph Embeddings at Scale==
Tunable Streaming Graph Embeddings at Scale
Serafeim Papadias
Technische Universität Berlin
supervised by Prof. Volker Markl
s.papadias@tu-berlin.de
ABSTRACT Characteristic tasks include vertex classification, link predic-
An increasing number of real-world applications require ma- tion, link reconstruction, topic modeling, and, community,
chine learning tasks over large-scale streaming graphs, where anomaly, fraud, and outlier detection.
nodes and edges are continuously being added or deleted. A very popular and effective technique for solving such
Graph embeddings have been widely used for solving such ML tasks are graph node embeddings (a.k.a network repre-
tasks by capturing the graph structure and features into a sentation learning). Given a graph G = (V, E) with n nodes,
low-dimensional latent space. However, current approaches a graph embedding maps each node v ∈ G to a compact fea-
have one or more of the following disadvantages: (i) they ture vector in a lower k dimentional space (k n), which
are designed for either static or dynamic graphs and thus, captures graph structure and properties in the vicinity of
need retraining after each graph change or periodically up- v. These vectors are derived by optimizing objective func-
dating the embeddings after each snapshot arrival, (ii) they tions preserving geometric relationships among graph nodes.
fail to scale to today’s size of graphs composed of billions Computing embeddings is a crucial problem by itself, as they
of nodes, or (iii) yet the ones devised for streaming graphs serve as inputs to downstream ML tasks mentioned above.
perform redundant retraining computations by mandating The frequency of updating the embedding vectors plays a
continuous embedding updates even if the accuracy is not significant role on runtime performance and accuracy, and
improved. The goal of this thesis is to overcome the above- should be driven by the subsequent ML task. For example,
mentioned problems by devising tunable streaming methods critical downstream applications, such as anomaly detection,
that can scale to massive graphs. We envision an end-to-end should instantly react to graph changes for capturing all
ML streaming system that achieves that goal and provides anomalies; hence, their input embeddings must constantly
users with abstractions to easily define their own streaming remain up-to-date for producing accurate results. For the
embedding algorithms. above scenario, static embedding methods, such as [9], are
unsuitable, as they need to retrain embeddings from scratch
after each graph change. Dynamic techniques, such as [6]
are not sufficient either, as they update embeddings period-
1. INTRODUCTION ically after each snapshot arrival; hence, fall short of discov-
Graphs are omnipresent in various domains such as so- ering anomalies appearing during the idle period between
cial media, transportation, finance, IoT, and biological net- two graph snapshots. Even though streaming algorithms [7,
works. Typically, real-world graphs are inherently dynamic, 10] refine the embedding vectors after every graph update,
entailing continuous additions and deletions of vertices and this can be potentially unecessary as graph structure may
edges. The frequency of these graph updates lies on a spec- not be substantially altered to affect the accuracy of the
trum. In dynamic graphs, updates appear in batches as downstream ML task. Thus, anomaly detection and simi-
graph snapshots and are applied periodically. In streaming lar continuous ML tasks dictate flexible streaming solutions
graphs, updates arrive spontaneously and are incorporated that are tunable: they can adjust the frequency with which
on-the-fly. For instance, social networks that model friend- the embeddings are being updated.
ships between users are highly dynamic, while citations net- Nowadays, real-world graphs not only dictate (tunable)
works modelling relations among scholars in academic net- streaming algorithms due to their ever-changing nature, but
works are less dynamic. Many real-world applications, such also scalable solutions because of their massive size. How-
as news recommendation or crime detection, can be mod- ever, the majority of existing embedding techniques, either
eled as machine learning (ML) tasks over streaming graphs. the ones concerning static graphs [3, 9, 11, 12] or the ones de-
signed for dynamic graphs [6, 7, 10], are centralized; hence,
they do not scale on massive graphs. Ideally, scalability
should be achieved through the distribution of processing
on clusters of commodity machines; thus, avoiding solu-
tions that utilize expensive servers machines with several
terabytes of main memory or unaffordable GPUs. Few avail-
able scalable graph embeddings systems [5] exist; however,
Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo,
they are unable to operate on evolving graphs. Thus, there
Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted is a lack of solutions that can both achieve scalability and
for private and academic purposes.
conduct streaming graph embedding processing. Kaixis System Graph ML Library
As ML and graph embeddings in particular become more
and more popular there is a need for systems facilitating Embedding Primitives
the development of such algorithms by hiding the system
complexity from the users. For instance, how the system Stateful Walker
distributes the processing of defined algorithms should be Random Walker Training Monitor QoS & Results
agnostic to the users. There are few systems with this goal:
they provide primitives for distributed random walk com- Walk Storage Embedding Trainer Task Trainer
putation on static graphs [13] or for graph neural network
computation [14, 15]. However, no such comprehensive sys-
tem for (tunable) streaming graph embeddings exists. Figure 1: System architecture of Kaixis.
In this thesis, we strive to: (i) devise tunable streaming
methods that generate embeddings incrementally by adjust-
ing the frequency of vector updates, (ii) build a scalable maintains first and second-order random walks in a stream-
solution which processes streaming embedding algorithms ing graph is proposed; however, it is centralized and lacks of
in a distributed manner, and (iii) provide abstractions that theoretical guarantees of correctness.
ideally incorporate distinct classes of streaming embedding Walk-based embedding algorithms fall short mainly in two
methods. Our goal is to synthesize these solutions into crucial aspects. First, they incur computational costs pro-
Kaixis1 , a novel end-to-end system, capable of bridging the portional to the number of deployed walks; rendering them
gap between streaming and distributed embedding methods. insufficient for massive graphs, especially the widespread
centralized solutions. Instead, massively-parallel and ideally
cost-effective solutions, such as distributing tasks into clus-
2. RELATED WORK ters of commodity machines, are required. Secondly, there is
no well-established streaming walk-based technique, which
Representation learning on graphs received huge attention has robust theoretical analysis. In contrast, we aim for (tun-
from researchers in the past few years. There are three main able) streaming and distributed walk-based embeddings that
categories of graph embeddings based on: (a) random-walks, are also theoretically established.
(b) matrix-factorization, and (c) deep learning. In what fol- Walk-based Systems. KnightKing [13] is a distributed
lows, we focus on the first category as the other two incur system specific for computing random walks on static graphs.
extremely high costs rendering them unsuitable for stream- It offers a walker-centric computation model, which is able
ing scenarios, i.e., deep learning-based train directly on the to express various walk algorithms. Its extreme efficiency
whole graph and factorization-based suffer from expensive is attributed to the rejection sampling it utilizes, especially
matrix operations [6]. We mainly review random walk-based when computing cumbersome higher-order walks. However,
(dynamic) network embedding algorithms and systems. KnightKing is incapable of computing streaming random
Walk-based Algorithms. Embedding methods based walks, as it requires the whole graph in advance. Except
on random walks mainly consist of two phases: the ran- for academia, industry has also shown huge interest on scal-
dom walking that explores, samples and captures certain able systems for the graph neural networks (GNNs) [14, 15],
properties of the graph, and the training that subsequently which are relevant to graph embeddings. Aligraph [15] pro-
ingests the produced random walks and trains embedding vides primitive operators that abstract common concepts
vectors e.g., using the well-known Skip-Gram [8] model. among GNN algorithms, facilitating users in implementing
Depending on the random walk type, different properties and deploying them. AGL [14] builts upon k-hop neighbor-
are captured. Specifically, DeepWalk [9] performs trun- hoods2 and utilizes MapReduce and Parameters Servers to
cated random walks to preserve first-order proximities in speed up GNN training; a relatively straightforward task
a graph, whereas node2vec [3] deploys biased second-order due to the linear nature of GNNs. As far as we know,
walks capturing second-order proximities. LINE [11] opti- there is no comprehensive system that provides primitives
mizes an objective function that preserves both first-order for streaming embedding algorithms.
and second-order proximities while deriving the embedding
vectors. GraRep [1] captures higher-order proximities in the
final embedding vectors. GraphCSC [2] deploys centrality- 3. KAIXIS ARCHITECTURE
based walks capable of learning embeddings that preserve In this section we introduce Kaixis, our envisioned end-
graph characteristics, such as degree and betweenness, and to-end system for computing streaming random walk-based
finally aggregates them into one vector. Nevertheless, all graph embeddings on the fly and at scale. Figure 1 shows the
the techniques above are designed for static graphs; hence, system architecture consisting of seven main components: a
are unable to adapt in streaming scenarios that we focus on. Graph ML Library, the Embedding Primitives, a Stateful
Interestingly enough, only [7, 10] address the problem of Walker, an Embedding Trainer, a Task Trainer, a Training
streaming graph embeddings. In [7], a rather ad-hoc influ- Monitor, and the Results and QoS module.
ence propagation model is used for locating nodes whose The input of the system consists of continuous ML tasks,
embedding vector is influenced by a graph change (either such as anomaly detection, along with a graph stream source
addition or deletion). Also, a vertex stream is assumed, i.e., and certain user-defined requirements (e.g., accuracy > 70%).
each new node arrives along with its complete adjacency list, Kaixis perpetually computes tunable streaming graph em-
which is a limitation. In [10], a method that incrementally bedding vectors, which are subsequently forwarded as inputs
2
For vertex v, the set of vertices reachable from v within at
1
From the turkish word Kayikçi; the owner of a fishing boat. most k steps.
to high stakes ML tasks. The output of the system con- defined requirements. The frequency of Embedding Trainer
sists of (i) real-time results for the high-stakes tasks, and should be driven by the downstream (possibly critical) ML
(ii) quality of service (QoS) metrics, as depicted in Figure 1. tasks, such that the embeddings are kept up-to-date. Thus,
Kaixis addresses two types of users: end-users and develop- the embedding training is not everlasting but tuned in real-
ers. End-users interact with the system through the Graph time by the Training Monitor. For instance, if the user wants
ML Library for specifying their ML task and QoS require- to detect outliers in an evolving graph, the embedding vec-
ments. Developers use the flexible Embedding Primitives tors should be updated just as frequently as it is necessary
for defining tunable streaming embedding algorithms. In a for capturing all outliers instantly. In other words, if an up-
nutshell, after users submit their queries, Kaixis follows con- date in the embedding vector does not yield any change in
crete steps. Namely, the Stateful Walker constantly explores the prediction results, it should not be performed to avoid
the evolving graph and keeps the stored random walks up- unnecessary computation.
to-date. Subsequently, the Embedding Trainer updates the Results & QoS. This component serves as a reporting
existing embedding vectors on-the-fly, based on the walks unit gathering results and QoS metrics of the ML task from
received from the Stateful Walker. Then, the Task Trainer the Task Trainer. QoS consist of accuracy metrics, e.g., area
uses the embeddings to produce query results. The Training under the curve (AUC) and micro-F1 score, and performance
Monitor has the power to switch on and off either of the Em- measurements, such as throughput and execution time.
bedding and Task Trainer. It selectively enables retraining
of existing embeddings and/or ML task models only when 4. THE RESEARCH ROAD AHEAD
needed to avoid excessive computation. In the following, we
detail each component. Our goal is to serve dynamic applications that can lever-
Graph ML Library. End-users interact with Kaixis via age graph embeddings used for retrieving information from
this library, which is a collection of possibly pre-configured an evolving graph. In essence, Kaixis extracts embeddings
algorithmic operators directly invoked without necessitating from a graph stream in a tunable way and feeds them to
hyper-parameter tuning. These operators solve tasks, such downstream ML tasks. The system can operate at the finest
as link prediction, anomaly detection, fraud detection, out- granularity, i.e., always derive the latest vectors and train
lier detection, graph reconstruction and vertex classification. the latest ML model of a task. However, its profound goal
Embedding Primitives. Developers interact with Kaixis is actually to avoid excessive training and instead strive for
through the Embedding Primitives, which enable them to continuously adjusting retraining frequency by monitoring
easily implement, integrate and deploy streaming embed- the QoS metrics.
ding methods, without having to worry about how the dis-
tribution is handled by Kaixis. Additionally, any optimized
4.1 Research Challenges
implementation of a primitive, also speeds up every method Realizing Kaixis is far from straightforward. Below, we
that utilizes this primitive. Hence, Embedding Primitives highlight five research challenges:
unify optimizations of distinct embedding methods.
Stateful Walker. Random walks are core concepts of (1) Streaming Random Walks. Maintenance of random
graph embeddings. A Stateful Walker in Kaixis, consists walks on evolving graphs should not compute all walks from
of two parts: the Random Walker and the Walk Storage. scratch after a graph update, but only revise the already
The former constantly explores the ingested graph stream kept walk corpus. Most importantly, the refined walk corpus
and produces new random walks for newly appearing nodes at time t + 1 should be statistically equivalent to the corpus
or updates walks attributed to already existing nodes. The at time t. Equivalently, the updated walk corpus at time
Walk Storage unit is responsible for efficiently storing the t + 1 should have the same probability of being produced
latest random walk corpus. as a corpus derived from scratch by totally recalculating
Embedding Trainer. As shown in Figure 1, the random all the walks. Different policies can be used for updating
walks produced (either newly formed or new parts of modi- walks [10], but clearly much remains to be done; both on
fied ones) are forwarded to the Embedding Trainer to refine the theoretical side for coming up with sound policies and
and output the latest embedding vectors by conducting in- on the performance side via distribution.
cremental training. In doing so, the trainer hosts a variety (2) Scalable Random Walks. The huge magnitute and
of training algorithms in its artillery e.g., online Skip-Gram the high dynamicity of nowadays graph data renders cen-
and Stochastic Gradient Descent models. tralized random walk calculation highly insufficient. Yang
Task Trainer. The user’s selected ML application e.g., et al. [13] crafted a whole system solely dedicated to dis-
anomaly detection, has to be executed in a continuous way. tributed random walks computation. Adapting their ideas
We thus opt for online ML algorithms where the training of to streaming graphs is far from trivial. One cannot afford
the model is performed in an online fashion similarly with to store the entire graph stream in a streaming setting. In
the serving part. Kaixis deploys the Task Trainer to update addition, one should carefully distribute the graph on-the-
on-the-fly the model of the specified ML task and finally fly to facilitate the walk calculation by cluster nodes, while
output the prediction results. avoiding excessive communication. Network communication
Training Monitor. Both Embedding and Task Trainer is too costly, therefore acute streaming graph partitioning
perform online training. However, it is important to note should ensure extremely scalable streaming random walks.
that performing blindly online training, i.e., updating the
(3) Walk Storage Sharing. In graph node embeddings,
model after every single change in the streaming graph may
numerous random walks are created for each single vertex,
lead to unnecessary excessive computation and thus, de-
resulting in walk sets with overlapping parts. Since Kaixis
grade the system’s performance. Specifically, the frequency
needs to maintain a random walk corpus, effective techniques
of Task Trainer should be large enough to satisfy the user-
that store overlapping walk parts only once are crucial. The
real challenge, is to come up with compression schemes for Acknowledgments. The author would like to thank
succinct representation of the whole walk corpus. Ideally, Prof. Volker Markl and Dr. Zoi Kaoudi for their pristine
the compression should be lossless and enable processing of guidance, as well as Dr. Eleni Tzirita Zacharatou for her
walks in their compressed form without the need for de- invaluable feedback. This work was funded by the German
serialization. Finally, the streaming setting increases the Ministry for Education and Research as BIFOLD - Berlin
complexity of the problem, as it dictates the possibility of Institute for the Foundations of Learning and Data (ref.
updating the walks while in compressed form. 01IS18025A and ref. 01IS18037A).
(4) Monitoring. The Training Monitor is the brain of
our conceived system: It tunes the frequency of retraining 5. REFERENCES
performed by either the Embedding Trainer or the Task [1] S. Cao, W. Lu, and Q. Xu. GraRep: Learning Graph
Trainer. A number of research questions arise, such as: Representations with Global Structural Information.
(i) when should the Training Monitor trigger the Embedding In CIKM, pages 891–900, 2015.
and Task Trainers, e.g., periodically or based on a certain [2] H. Chen, H. Yin, T. Chen, Q. V. H. Nguyen,
reasonable mechanism, (ii) what is the impact of a trainer W. Peng, and X. Li. Exploiting Centrality Information
that is disabled to the final ML task result, and (iii) how each with Graph Convolutions for Network Representation
specific ML task chosen affects the monitor’s decisions, i.e., Learning. In ICDE, pages 590–601, 2019.
high stakes ML tasks would differ from non-critical ones. [3] A. Grover and J. Leskovec. Node2vec: Scalable
Clearly, the monitor’s behaviour is driven by downstream Feature Learning for Networks. In KDD, page
ML tasks. 855–864, 2016.
(5) Primitives. To facilitate users implementing, integrat- [4] N. Kaji and H. Kobayashi. Incremental Skip-gram
ing and deploying streaming embedding methods, Kaixis Model with Negative Sampling. In EMNLP, pages
should offer primitives that hide the implementation details 363–371, 2017.
of how distribution is handled. Designing such primitives [5] A. Lerer, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt,
is challenging as it implies breaking various embedding al- A. Bose, and A. Peysakhovich. PyTorch-BigGraph: A
gorithms down to “atoms”. In Kaixis, primitives are ex- Large-scale Graph Embedding System. In SysML,
ecuted as extremely performant streaming and distributed page 285–296, 2019.
random walk-based operations. These abstractions offer the [6] J. Li, H. Dani, X. Hu, J. Tang, Y. Chang, and H. Liu.
potential for transparent optimizations, i.e., optimizations Attributed Network Embedding for Learning in a
to a primitive used by many algorithms, end up optimizing Dynamic Environment. In CIKM, page 387–396, 2017.
them all in one shot. [7] X. Liu, P.-C. Hsieh, N. Duffield, R. Chen, M. Xie, and
X. Wen. Real-Time Streaming Graph Embedding
4.2 Research Plan Through Local Actions. In WWW, page 285–293,
To conclude, we present our research strategy for tackling 2019.
the aforementioned challenges and realizing Kaixis. [8] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
Streaming and Scalable Random Walks. The first J. Dean. Distributed Representations of Words and
step is to design the Stateful Walker for calculating stream- Phrases and their Compositionality. In NIPS, pages
ing random walks using streaming graph partitioning and 3111–3119, 2013.
distribution. To assist our goal we plan to use an efficient
[9] B. Perozzi, R. Al-Rfou, and S. Skiena. DeepWalk:
and succinct walk storage representation. Finally, we plan
Online Learning of Social Representations. In KDD,
to establish the Stateful Walker theoretically by: (i) proving
page 701–710, 2014.
the statistical equivalence of walk update policies, and (ii) de-
riving complexity bounds for procedures updating walks. [10] H. P. Sajjad, A. Docherty, and Y. Tyshetskiy. Efficient
Monitored Embedding Training. On the one hand, Representation Learning Using Random Walks for
we plan to investigate relevant online training algorithms Dynamic Graphs. CoRR, abs/1901.01346, 2019.
proposed in the literature (e.g. [4]) and thoroughly evaluate [11] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and
them to decide which ones to incorporate in the Embedding Q. Mei. LINE: Large-Scale Information Network
Trainer’s artillery. On the other hand, we plan to design Embedding. In WWW, page 1067–1077, 2015.
the Training Monitor such that it configures frequency of [12] D. Wang, P. Cui, and W. Zhu. Structural Deep
retraining/updating embeddings, driven by the importance Network Embedding. In KDD, page 1225–1234, 2016.
of the downstream ML tasks; for high stakes tasks, this fre- [13] K. Yang, M. Zhang, K. Chen, X. Ma, Y. Bai, and
quency is larger. To achieve scalability, both the Embedding Y. Jiang. KnightKing: A Fast Distributed Graph
Trainer and the Training Monitor should be carefully con- Random Walk Engine. In SOSP, page 524–537, 2019.
jured to operate in a fully distributed manner, also aiming [14] D. Zhang, X. Huang, Z. Liu, Z. Hu, X. Song, Z. Ge,
on minimizing communication between them. Z. Zhang, L. Wang, J. Zhou, and Y. Qi. AGL: a
Powerful Primitives. Our perpetual goal along this en- Scalable System for Industrial-purpose Graph Machine
deavour is to conjure expressive primitive abstractions that Learning. arXiv preprint arXiv:2003.02454, 2020.
facilitate users. Finally, we strive for devising effective op- [15] R. Zhu, K. Zhao, H. Yang, W. Lin, C. Zhou, B. Ai,
timizations for each primitive, as various methods enclosing Y. Li, and J. Zhou. AliGraph: A Comprehensive
such a primitive will benefit simultaneously. Graph Neural Network Platform. VLDB,
12(12):2094–2105, 2019.