<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Non-continuous Distributed Graph Computing System using Persistent Memory</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miaomiao Cheng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiujian Chen</string-name>
          <email>jiujian@bit.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cheng Zhao</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cheng Chen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yongmin Hu</string-name>
          <email>huyongmin@bytedance.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaoliang Cong</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liang Qin</string-name>
          <email>qinliang.touyi@bytedance.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hexiang Lin</string-name>
          <email>linhexiang@bytedance.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rong Hua Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guoren Wang</string-name>
          <email>wanggr@bit.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lei Zhang</string-name>
          <email>zhanglei.michael@bytedance.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Beijing Institute of Technology</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Douyin Vision Co., Ltd</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Significant memory consumption. For large-scale</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Graph computing systems play a critical role in a variety of industrial applications. This study examines ByteDance's graph computing system workload, which challenges the conventional notion of a one-shot, lightweight graph computing task that can scale to trillions of edges. The workload includes both small and large-scale tasks separated by a 1000-second runtime threshold. The majority of the workload is dominated by small-scale tasks submitted arbitrarily, but with high time-sensitive requirements. make up the bulk of computing resources and occur periodically. Therefore, the graph computing system must be capable of pausing running tasks and prioritizing more critical ones. In this paper, we introduce ByteGAP, a non-continuous graph computing system that leverages PMEM's unique features, such as durability, byte-addressability, memory-like access, lower latency, and high capacity. The non-continuous approach uses checkpointing mechanisms to achieve efective fault detection and recovery. ByteGAP provides two key contributions: (1) lightweight distributed checkpointing based on PMEM, (2) eficient dual-mode PMEM management for optimizing PMEM read and write operations. Moreover, we present a comprehensive evaluation method that demonstrates the system's ability to handle the challenges associated with large-scale computing tasks. The findings lay the foundation for future research in distributed graph computing systems and advocate for a non-continuous approach to graph computing.</p>
      </abstract>
      <kwd-group>
        <kwd>graph</kwd>
        <kwd>non-continuous graph processing</kwd>
        <kwd>persistent memory</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        dustry[
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ], catering to a variety of application
scenarios. In ByteDance, thousands of graph
comous scenarios such as recommendations, fraud
detection, and content auditing. To better understand
the characteristics of real-world graph computing
      </p>
      <sec id="sec-1-1">
        <title>Joint Workshops at 49th International Conference on Very</title>
      </sec>
      <sec id="sec-1-2">
        <title>Large Data Bases (VLDBW’23) —Workshop on Accelerat</title>
        <p>ing Analytics and Data Management Systems (ADMS’23),
100% ploying large-scale graph computation on PMEM
se90% R#Teasoskurces presents a challenge. To ensure data consistency,
rcu80% graph computation requests write operations
folseo70% low a certain order [11, 12]. However, keeping the
/sakTR6500%% writes to PMEM connected via memory bus in a
:#d40% certain order requires a special set of CPU
instruclize30% tions such as CLFLUSHOPT or CLWB followed by
ram20% SFENSE, which has been proven to be extremely
oN10% expensive[13, 14, 15].</p>
        <p>
          0% In this paper, we propose a non-continuous
disFigure 1: Percentage of the number of tasks and re- tributed graph computing system based on PMEM,
sources (running time multiplied by used core numbers) ByteGAP. To handle ByteDance-style graph
comoccupied for diferent task-scale in ByteDance graph puting workloads, we suggest a non-continuous
stratcomputing workloads. Tasks are separated by a run- egy that preserves checkpoints for graph topology
time threshold of 1000 seconds. structures and vertex states during the iterative
computing process. To increase eficiency, we
further specialize in the design of PMEM checkpoints.
reach a trillion-scale, and at least 10TB of memory To eficiently store topology structure and vertex
is required to store the pure graph data structure. states from various graph computing algorithms, we
These characteristics present challenges: design a dual-mode PMEM management strategy
• Potential other task blockage due to lack that supports both fixed and variable data
allocaof pause capability. Large-scale graph com- tion. In summary, the main contributions of this
puting tasks cannot be paused, which may paper are as follows.
result in blocking other tasks when
preemptive strategies are not in place [
          <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
          ].
• Lower tolerance for timeliness. Graph tasks
must meet application timeliness, such as
daily updated search tasks. For small-scale
tasks, timeliness requirements can be ensured
through a retry when encountering
execution failures. However, large-scale tasks have
higher retry costs, and when cluster failures
occur during computation[7, 8],it may be
impossible to satisfy the task’s timeliness
requirement.
        </p>
        <p>Therefore, to address these challenges, graph
computing systems must support checkpoint
mechanisms and implement efective non-continuous graph
computing techniques to ensure high availability
within the graph computing cluster. The
noncontinuous graph computing system supports
effective fault detection and recovery strategies and
allows high priority tasks to interrupt low-priority.</p>
        <p>This approach ofers significant advantages for
largescale graph computing systems.</p>
        <p>Large-scale graph computing workloads require a
large amount of storage space, with DRAM deliv- The rest of this paper is organized as follows.
ering exceptional performance and SSDs providing §2 describes the overall system architecture. We
persistent checkpoint storage and extending DRAM. discuss the PMEM-based checkpoint-centric system
In addition, PMEM boasts unique characteristics, design in §3. Experimental evaluation is reported
such as durability, byte-addressability, memory-like in §4, and §5 reviews the related works. At last, we
access, lower latency, and high capacity[9, 10], ren- conclude our work in §6.
dering it a better choice for constructing
checkpointcentric graph computing systems. However,
de• Lightweight distributed checkpointing.
Byte</p>
        <p>GAP supports eficient, lightweight
distributed checkpoints based on PMEM to
facilitate efective non-continuous large-scale
graph computation. Tailored to graph
computation by considering both graph topology
structure and vertex state data. We have
developed a checkpoint saving mechanism
using PMEM to improve system reliability
and performance.
• Eficient dual-mode PMEM management.</p>
        <p>To accommodate multiple data patterns in
the system, ByteGAP integrates a
carefullydesigned dual-mode persistent memory
manager to handle and optimize PMEM read
and write operations. To achieve high data
access performance, the persistent memory
manager employs an innovative dual-model
PMEM allocator design, integrating both
pool-based and log-based PMEM allocators
to support fixed-size data and non-fixed-size
data storage, respectively.</p>
        <p>Disk / HDFS / . . Compute() Combine() Aggregate() Memory Manager. Batch Processing represents a</p>
        <p>Application collection of computing operations that we will
disLGoraadpehr DRuemsupletr Batch Processing cmuuslstiipnle§t3h.1re.aIdts,ingveonkereasttinhge mcoemsspaguetsinagndmuepthdoadtining
ComSmeunndiecration Thread Pool Vertex Table t-engPCA tphreocveaslsuinegofavnedrtciocems.mMunoirceaotvieorn, aalrletmhraenadagseudsebdyina</p>
        <p>Thread Pool. To perform a graph algorithm
comReceiver CheckpoCinotmMpauntianggerKernel putation, workers first use the Graph Loader to
load the graph from the Hadoop Distributed File
Edge Table Message Table Checkpoints System (HDFS) or local file system, where users
Persistent Memory Manager can self-define the input data format of vertices
and edges in the files. Loaded data will be
partiFigure 2: The system architecture of ByteGAP tioned among distributed workers and finally saved
on DRAM for Vertex Table and persistent
memory for Edge Table respectively, with contiguous
2. System Architecture storage to obtain the performance benefits of
sequential read/write. The algorithm is conducted in
We depict the architecture of ByteGAP in Figure 2, rounds of supersteps, where every worker processes
omitting some trivial connections between compo- the computation in batches and shufles messages to
nents and data for readability. The system can be others in each superstep. When the last superstep
seen as consisting of four parts (i.e., Application, is finished, the Result Dumper will write the results
CP-Agent, Computing Kernel and Persistent Mem- back to the storage devices in a user-defined format
ory Manager). Each node in the ByteGAP cluster and terminate the program.
will deploy only one process to perform graph com- The underlying part is the Persistent Memory
putation while fully occupying all CPU resources. Manager (§3.3), which handles access to PMEM</p>
        <p>The global runtime environment is managed by devices and persistent data management
(includthe CP-Agent. We build agents to spawn, ren- ing data storage, indexing, memory allocation, and
dezvous, and monitor workers across all nodes. They garbage collection). Benefiting from the
byteplay a crucial role in achieving the checkpointing addressable and good read/write performance of
for our system. Beyond the system layer, user- PMEM, we leverage it to store massive data in both
defined graph algorithms are written as Applica- Edge/Message Tables and CheckPoints. For the
tion by specifying and injecting exposed interfaces. consideration of checkpointing, the Checkpoint
ManThe Compute() method serves as the kernel for ager (§3.2) will generate and manage checkpoints
vertex-centric graph processing. It is defined by the happened at certain supersteps. When entering a
users to implement the specific computation logic failure recovery process, workers will use these
checkover each vertex along with its neighbors, follow- points to restore themselves to continue distributed
ing the classic ‘think like a vertex’ programming computing. We heavily engineered our system to
paradigm, based on various distributed graph algo- be adapted for persistent memory, to achieve the
rithms. In addition, there are two optional functions best system performance with checkpointing.
that users can customize. The Combine() method is
used to combine these messages targeting the same
destination vertex locally, reducing communication 3. System Design
overhead across nodes. The Aggregate() method
periodically aggregates global intermediate result- We first introduce the computing kernel of ByteGAP
s/statistics across nodes during computation, acting in §3.1, which is the essential part of the system’s
as a signal for global algorithm behaviors or system execution. Then, we will discuss how to achieve fast
controls. recovery with checkpoints in §3.2, and discuss how</p>
        <p>Computing Kernel is the core part of ByteGAP. to store and manage data in persistent memory in
It executes applications based on the BSP model, §3.3.
using user-defined methods in iterations. We
further divide the worker into several components ac- 3.1. Computing Kernel
cording to their functions. The Sender and the
Receiver drives the tasks between computation and In this section, we will introduce the basic data
communication, including combining messages and layout and computing module of ByteGAP.
constructing the Message Table in the Persistent</p>
        <p>MEM is expected that communication costs will overlap
ID Value Active Of set with computation in such pipelines.</p>
        <p>1 1.0 true 0 To reduce the total amount of messages processed
Machine 1 thread 1 2 1.0 true 2 in each superstep, we take message properties-based
thread 2 34 11..00 ftarulsee 56 approaches in diferent scenarios. When multiple
5 1.0 true 8 messages sent to a vertex can be combined, a
userMachine 2 thread 1 6 1.0 true 5 NVM defined Combiner is adopted in both Sender and
thread 2 ... ... ... ... Receiver. In one sending batch, messages will be
combined at the same destination. And after being</p>
        <p>Vertex Table Edge Table Msg Table received, messages will also be merged into one
Figure 3: The data layout of ByteGAP ifnal message during generating the Message Table.
Therefore, the indices of batches of messages can be
easily calculated by the ofsets in the Message Table
for computing dispatchers in the next superstep.</p>
        <p>Page-based data layout. CSR is a widely used for- We also use mirror vertices to reduce
communicamat for storing sparse graphs and matrices in graph tion costs when each vertex sends the same message
systems [11, 16, 17]. We employ CSR to store graph to its neighbors. A Remote Message Table is built
data for the benefit of cache prefetching and eficient alongside the resulting Message Table. When a
verdata access. Figure 3 depicts the data layout format tex needs to send an identical message to multiple
of ByteGAP. All data associated with vertices is neighbors on a node, there will be a mirror vertex in
stored as a table in memory for eficient updating the Remote Message Table on the target node.
Mesand fast access. Typically, each vertex takes one row sage passing only occurs between the original vertex
in the array, storing the vertex ID, the vertex value, and its mirrors. Neighbors then fetch messages from
the active state, and an ofset to the start position the mirrors to fill the Message Table for the next
of its adjacency-list in the Edge Page. Each active superstep. For ease of programming, mirror vertices
state and vertex value is modifiable at run-time for are transparent to algorithms and improvements are
algorithm processing. To adapt the CSR format on optional to reduce communication costs. We also
PMEM, we organize the adjacency lists of all ver- take optimization for vertices with huge degrees.
tices into a global linked page-list (i.e., Edge Table), They will have mirror vertices in each node and the
where each page has a fixed length to store a fixed messages passing to them are first gathered locally
number of edges. Then, the ofset of  [] and after sending them out. To prevent concurrent write
 [+1] can clearly indicate the start and end po- conflicts on such mirrors, we collect messages within
sition (i.e., page number and ofset on the page) of each thread and combine them upon completion of
the adjacency-list of  [] . Accordingly, to fetch each superstep. Based on the skewed distribution
the neighbors of one given vertex, we can iterate the of graph vertices, we dynamically select the
threshpages on the Edge Table by sequentially reading on old for degrees, trading mirror memory usage for
PMEM, which performs performance (i.e., latency reduced global communication costs.
and throughput) comparable to DRAM.</p>
        <p>In addition, we also need to iterate over messages
belonging to each vertex during the computation. 3.2. Lightweight Distributed Checkpointing
Each message includes a source vertex ID and mes- To facilitate the recovery of our computing
worksage data. Similar to the Edge Table, we also ar- ers, we construct checkpoints at periodic number of
range all message data into a global list of linked supersteps (called  ) that need to do checkpoint.
pages on PMEM, called the Message Table. For ease of access, Checkpoint Manager saves
check</p>
        <p>Message passing. While computing, vertex can points as key-value pairs in the PMEM store for
send messages to any other vertices to be processed fast retrieving after process restart. The keys are
in the next superstep. Message passing between generated based on both the identifiers of each
sysvertices may appear across diferent workers. The tem component and the current system superstep
two components of ByteGAP, Sender and Receiver, number, where the values are the specific states
are used to manage this process between workers. or computing data that need to be stored inside
Alongside the compute dispatcher, Sender grabs checkpoints.
messages from compute threads and sends them Leader-follower checkpointing. Checkpointing
ocout once the message count reaches a batch size. curs at the beginning of each  , then each worker
Meanwhile, Receiver accepts these streaming mes- will enter into a specific routine to construct the
sages and saves them to the Message Table. It checkpoint individually. These checkpoints exist on
the local persistent memory of each node, where we regular supersteps, it will be overwritten by new
indiscuss its format in 3.3. Note that to ensure the coming messages iteratively at each superstep. The
integrity of the newly generated checkpoint before other two bufers work interchangeably for
checksafely removing the previous checkpoint, all nodes pointing rounds, in which case it can ensure that
should set up a global barrier before the atomic the latest checkpointed messages are always
availcheckpoint swap. In addition, we assign a node to able even as we write the next checkpoint. We
be the worker leader, for using it to generate the only need to store an index to track which bufer
system’s unique Last Checkpoint ID (LCI). LCI holds the latest messages at each  . Therefore, the
is a basic concept for the checkpoint mechanism, overall checkpointing overhead is lightweight due
which indicates what the last global  is. We only to the spread of data in our system. And during
allow the leader to generate and store the LCI, and recovery, we only need to restore the indices of the
broadcast it to other followers during the recovery, key-value pairs in persistent memory, as we will
to avoid inconsistent distribution, considering if one describe below.
worker fails at checkpointing while the others suc- Checkpointing with agent. Inspired by Torch
Discessfully save the new checkpoint. Therefore, at the tributed Elastic [20], which enables PyTorch to run
end of the checkpointing routine, each follower will in a fault-tolerant manner, we achieve our CP-Agent
send a completed notification to the leader after its to manage distributed computing workers in
Bytecheckpoints have persisted. And only after receiving GAP. To assemble the essential environment for
all completed messages, the leader can update the communication and monitor the running status of
LCI and acknowledge it to followers by notifying each computing process. It is commonly assumed
them that the previous checkpoint is out of date. that as cluster size increases, the chance of the entire
When CP-Agents relaunch all workers, the leader system failing due to the failure of a single worker
will load the LCI from its persistent layer and broad- increases as well. Therefore, how to solve retrieval
cast it to all followers to restore the checkpointed in a distributed graph computing system becomes
data. a primary concern in our design. The basic idea</p>
        <p>Lightweight checkpoints. After loading the graph, is to re-spawn working processes with the help of
topologies, i.e. edges and partition information, are CP-Agent, where CP-Agent ensures that the
eninvolved in every computation and message sending. tire program can be fully executed or terminated
They will not change during the computation of normally. If some processes encounter fatal errors
many algorithms, so that only one checkpoint for during the computation, CP-Agent will charge the
them is required before the first superstep of com- recovery of process rank, communication
environputing. And at each superstep, as described in §3.1, ment, and worker monitoring.
the computing dispatchers need to use the updated
states and messages received from local vertices from 3.3. Dual-Mode Persistent Memory
the last superstep, which are required to be saved
in checkpoint at each  . We adopt a lightweight Management
checkpointing approach that is dedicated to reduc- We use the Persistent Memory Manager (PMM) to
ing the extra overhead based on whether the data manage ByteGAP persistent memory space for
highis mutable or not and which memory they mainly performance PMEM access. PMM mainly supports
use. While edges are usually immutable and persist data storage for Edge Table, Message Table, and
only once at the beginning, vertex-related data, i.e., checkpoints. We conduct   as the minimum
alloVertex Table and Message Table, take up the bulk cation unit for PMEM with a fixed length. A series
of the workload when checkpointing. of pages can form the Edge Table and the Message</p>
        <p>For vertices stored in DRAM, we serialize them Table as a linked page-list. To facilitate optimal
down to the PMEM layer in parallel to make full data access and allocation, we set the page size to
use of PMEM bandwidth at minimum cost. The be the OS physical page size. While checkpointing
message checkpoint is thoughtfully considered in data can be arbitrary in length, PMM also needs
other works [18, 19] as its size can be proportional to support dynamic memory allocation. Further, as
to the number of edges. However, we have already we mentioned in §3.2, PMM should organize stored
placed messages in the PMEM as Message Table elements as key-value pairs, to better categorize
before storing checkpoints. We only need to treat diferent checkpointed components for their on- and
them as checkpoints and keep them available until of-load.
the next checkpoint is generated. Dual-mode allocator. Based on the above reasons,</p>
        <p>Specifically, we use three bufers in Message Table we adopt both pool-based and log-based allocators
to store received messages. One bufer is used for for fixed and non-fixed data storage, respectively,
PMEM
device
T1
Dealocate</p>
        <p>Object
Cache</p>
        <p>New chunk</p>
        <p>Object</p>
        <p>Cache
T2
Alocate Dealocate</p>
        <p>Global free list</p>
        <p>Object
Cache</p>
        <p>PdeMvEicMe cNhuenwk</p>
        <p>Object
Cache</p>
        <p>Large
chunk</p>
        <p>Alocate
T3</p>
        <p>T1</p>
        <p>T2</p>
        <p>T3
Dealocate Alocate</p>
        <p>Large size alocation
may occur during asynchronous deletion and
CPAgent will restart all workers after that. Then, in
the progress of PMM’s recovery through PMEM
devices scanning, we can directly drop those
useless records and remain the useful records based on
their lifetime type. Therefore, we classify all data
managed by the PMM into three types of lifetime:</p>
        <p>Keep, Delete, Exclusively Keep.
(a) Pool-based allocator</p>
        <p>
          (b) Log-based allocator
to achieve eficient data management (i.e.,
allocation, deletion, garbage collection) on PMEM. To
meet fixed size memory allocation requirements, the
pool-based allocator implements a shared memory
management strategy as follows. As illustrated in
ifgure 4a, when a thread attempts to allocate an
object, it first searches through its own private object
cache to get a free object. If no free object is left,
it will fetch free objects from a global free-list to
refill its own local object cache. Only if there are
not enough free objects in the global free-list, the
thread will allocate a new chunk from the PMEM
device, and using this new chunk to generate free
objects to refill the object cache. When a thread
wants to deallocate an object, the thread first sets PMM supports data recovery when workers
perthe tombstone bit of the deleted object and then form unexpected exits. Note that we reduce the
adds the object to the private object cache. If the recovery process overhead by using PMEM
runthread finds too many free objects (e.g., twice the time storage. Unlike other systems [
          <xref ref-type="bibr" rid="ref1">1, 21, 22, 19</xref>
          ],
bulk load size) in the object cache, it will move some which recovers the Message Table via
recomputfree objects from its private free-list to the global ing or reloading from disk log, we directly store
free-list. Obviously, such a strategy can provide and fetch the Message Table in PMEM at runtime
a pool-based allocator with excellent data locality without any further operations.
and high performance for data allocation. While,
for the log-based allocator depicted in figure 4b, it
manages the memory space in chunks, but each 4. Experiments
lfcccerhhaanutuggonnmtrkkhecssmhn,ttahaoaryetganiehocsnnaeetvithwetoelnayiaemdlealidpolfelsrcrooaetcvntoaietotmendsdiiazgotercfa.ahotulTbeonhjcdekaeaclttdiltoasuygw.fr-ribionStampghseedvecdaaxificrtiaasialatlllibodyn-l,ege- tseBhunyepvtiperreGoosrnAutmlPftoesrnaotigfmsr.oasuprthoevcpoarlmouvapituditeoinnscgoa.flaTBbhyltieseGnsoeAcntP-icooinnntrvieanpruioooruutsss
when the allocator finds a highly fragmented chunk,
it first scans the chunk to locate all valid records of 4.1. Experimental Setup
the chunk, and then copies those valid records to Testbed. We evaluated ByteGAP on a cluster of
a new chunk for seeking tighter data arrangement. up to 16 machines, each equipped with two Intel
Next, the allocator maps the memory access of the Xeon Platinum 8260 CPUs (48 cores in total) and
migration record to the new memory address, and 128GB of DRAM. Each node also contains 320GB
ifnally reclaims the memory space of the old chunk. of SSD and 512GB of Optane DC PMM, which will
        </p>
        <p>Persistent data lifetime. For persistent data be configured in memory mode and app-direct mode
stored in PMM belonging to diferent system com- depending on the situations in our evaluation.
ponents, we should use diferent types to manage Datasets. We adopt four real-world datasets for
their lifetime at runtime. For example, when a new our evaluation. Table 1 shows the statistics of these
checkpoint is generated, the old checkpoint does not four graphs: Twitter and Friendster from SNAP [23],
need to be maintained. While deleting an outdated and UK-2007 and UK-union from LWA [24]. Twitter
checkpoint is not an atomic operation, a failure and Friendster are the graphs that belong to the</p>
        <p>To fully evaluate performance, we recorded these
time metrics for ByteGAP on 2 to 10 machines
as shown in Figure 5a-figure 5c. There is one
CPAgent on each machine to keep PMM and
Computing Kernel healthy, and each worker will only save
social networking domain, while UK-2007 and UK- checkpoints to and load back from its local storage.
union are both Internet graphs. The number of Figure 5a depicts the time metrics for the first two
edges in each graph reaches the billion scale, and phases as described above. The time  0 of
writthe degrees of Internet graphs are more than social ing the initial checkpoint only takes up once in the
graphs. In addition, we generated a weighted graph whole running time. It is not as much scalable as
ffroormea0chtoda1t0a0setot beyacrhaneddgoem. ly assigning an integer of gbraepcahu.seHiotwinecvleurd, etshteimaeddoiftsioanvainlgctohset tfooproelaocghy</p>
        <p>Algorithms. We evaluated our system by four  superstep is always much smaller than normal
representative algorithms: PageRank, Connected execution. As the number of machines increases,
Components (CC), Breadth First Search (BFS) and the average execution time required for a superstep
Shortest Single Source Path (SSSP). We ran these decreases proportionally. And   also decreases
algorithms against other out-of-core systems for and stays at about one-third of   .
comparison. And since PageRank is implemented We ran ByteGAP to evaluate checkpoint
perforin all systems, we use it as a core test for several mance. The PMM used SSD as the underlying
experiments. storage, though our system supports this use case,
it lacks optimization for PMEM to improve I/O
4.2. Checkpointing Evaluation eficiency. Figure 5b shows that checkpoints were
faster on SSDs than PMEM. This is because the
We also examined the pros and cons of checkpoint- operating system has optimizations for SSD writes,
ing in ByteGAP. To evaluate the elapsed time on including caching, which PMEM cannot currently
checkpointing and recovery, we ran PageRank on the use. We also ran ByteGAP in a testbed where the
Twitter dataset and configured the system to write PMM uses a normal path on SSD as the underlying
checkpoints at each 10 superstep. In the next su- storage to inspect the time metrics about recovery.
perstep after saving a checkpoint, one of computing Although naturally supported in our system, such
kernels will kill itself to simulate a worker stopping. a use case lacks the specialized design for PMEM,
The PageRank algorithm ensures that each super- which can improve I/O eficiency. Figure 5c
illusstep has the same workload of computation and can trates the time of recovery with diferent machines.
generate the same number of messages. After the Each of the reported time decreases exponentially
worker stops, CP-Agent will relaunch and guide the as the number of machines grows. Starting with a
system to begin recovery. There are four important few machines,   of running on PMEM is much
phases in failure and recovery. We represent the less than running on SSDs. However, when there
time metrics of this process as follows: are more machines being involved, both of them
•   : the time of each superstep without sav- perform close metrics. Despite hardware limitations,
ing checkpoints, used to indicate the elapsed ByteGAP benefits from PMEM features and does
time of normal execution for comparison with not lose performance when using SSDs on multiple
checkpointing overhead. In our settings, this machines.
is the average time of ten supersteps. We also evaluated ByteGAP versus GraphX
(Spark 3.0) on 2 to 10 PMEM-equipped servers
•  0 and   : the time to write all needed to compare checkpoint performance. GraphX is a
checkpoints when the worker executes nor- widely used graph processing system. For GraphX,
mally. We use  0 to denote time of the we killed processes during computation to simulate
initial checkpointing before first superstep task failure and recovery.   is obtained by
suband use   as the time of saving checkpoints tracting the time of normal running tasks from failed
at the latest  superstep. and recovered tasks.   is also represented by the
•   : the time of the recovery by checkpoints. average time of the first 10 iterations. Since GraphX</p>
        <p>As our agents will also kill survival workers checkpointing time   involves synchronous copies
and is dificult to measure, the overhead is also delivered a smaller execution time compared to
ussmall. This experiment did not statistically analyze ing Memory mode with the same algorithm and
  but mainly focused on measuring the time of the same number of machines. As in the case that
  . Under 10 machines, we found that the average computing connected components on the Twitter
time required for a single iteration of GraphX is 27.1 dataset in 8 machines, it takes 25.19s with App
seconds, while   of GraphX takes 23.5 seconds. Direct mode and 55.29s with Memory mode. This
The average time required for a single iteration of indicates two reasons: (1) the data storage we
deByteGAP is only 4.3 seconds, while   takes only signed is more suitable for I/O with direct access,
2.7 seconds. ByteGAP achieved much shorter   and (2) using memory mode may result in
memand   than GraphX. For   time comparison with ory swapping too frequently, reducing ByteGAP’s
2 to 10 machines is shown in figure 5d, ByteGAP performance.
is much faster than GraphX for all numbers of ma- Our experiments verify that increasing the
numchines. Because GraphX uses Spark’s checkpoint ber of machines can reduce ByteGAP execution
function, only a portion of the computing resources time. The system benefits from more CPU resources
are used after recovery in the beginning, so when and higher parallelism in both configurations as the
there are too many computing nodes, the entire re- number of machines increases, as more machines can
sources cannot be used immediately after recovery, provide more CPU resources. For example, when
which can lead to resource waste. running PageRank on the Twitter dataset, using
Memory mode takes 279.64s at 2 machines and only
4.3. Scalability 55.29s at 8 machines, and using App Direct mode
takes 204.8s at 2 machines and only 39.77s at 8
We examine the scalability of ByteGAP in this sec- machines, depicting more machines needing to use
tion to evaluate the relationship between execution less time. This means that we can achieve better
time and cluster size. Intuitively, processing perfor- performance of our system by adding more machines
mance can be improved by increasing the number to improve parallelism. However, adding more
maof machines. Since persistent memory can be con- chines to the cluster may involve new overhead, such
ifgured as Memory mode and App Direct mode, we as system consistency guarantees, network
commuevaluated scalability by the running time in both nication overhead across machines, and hardware
configurations with up to 10 machines. Specifically, access overhead, all of which tend to degrade system
we adopted two datasets, Twitter and UK-union, performance. Experiments show that the
perforand used 2, 4, 6, 8, and 10 machines to run PageR- mance of using App Direct mode is worse than
ank and CC algorithms with two modes respectively. using Memory mode at 10 machines while
computFigure 6 reports the results of our evaluation. It ing PageRank on Twitter and CC on UK-union. It
takes diferent time when using Memory mode and also indicates that performance degradation using
App Direct mode as depicted. Therefore, in almost App Direct mode occurs earlier than using Memory
all experiments, ByteGAP using App Direct mode
0 2
Time/s
200
100</p>
        <p>GByratepGhCAhPi 2,852.47 2,850.97 GByratepGhCAhPi 5,887.82 We have mounted the Optane PMM to an
indepenise/Tm216.914,322.82 316.15 575.813,769.78 884.17 ise/Tm159.931,424.68 349.123,594.3 1,164.38,736.29 2,536.6 ifltdehesinystsptpeaamtthhbasynodOthlSea.atvAeiltlacslalynsDtbeRmeAtsMraearaetsecdtohnaefigsumaraesidtnatnmode-uamsleoonrye.</p>
        <p>Twitter Friendster UK-2007 UK-union Twitter Friendster UK-2007 UK-union While other systems accept the edgelist file format,
(a) PageRank (b) CC input graphs for GraphChi are preprocessed and
Figure 8: ByteGAP vs. GraphChi stored directly on PMEM. We compare the
execution time of ByteGAP with the above systems
by running PageRank, CC, BFS, and SSSP on all
datasets individually.
mode. For example, running PageRank on Twitter We first report the execution time compared to
dataset using App Direct mode takes 69.59s with 6 GraphChi as shown in Figure 8. There is a
difermachines and 39.77s with 8 machines, but it takes ent emphasis on the design of single-machine and
48.38s with 10 machines. However, this did not distributed systems, such as the shared-memory or
happen while using Memory mode. As a result, shared-nothing architecture and the synchronization
I/O performance remains a bottleneck in the stor- mechanism between supersteps. We sequentially ran
age layer of the distributed system with suficient ByteGAP from single machine to multiple machines
computing resources. for comparison. To better measure the computing
performance of the systems, we recorded the
run4.4. Comparison with Out-of-Core Systems ning time around the iteration of graph algorithms
In this subsection, we compare the computing per- only for each system. We start with the time of
formance of ByteGAP with other out-of-core sys- computing PageRank and connected components
tems because they have similar use cases with ex- in one machine by ByteGAP and GraphChi, as
deternal storage. GraphChi [17] and GraphD [25] are picted in Figure 8a and Figure 8b. Consequently,
well-known disk-based systems running in a single ByteGAP outperforms GraphChi on all datasets.
machine and distributed environment, respectively. The Parallel Sliding Windows for accessing disks in</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Related Works</title>
      <p>
        Recently, many graph computing systems have been
proposed following the vertex-centric programming
model, which allows each vertex to access the
received messages from the last superstep and send
new messages to its neighbors after computing
the updated value. Following Google’s Pregel
system [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], many Pregel-like open-source systems have
been proposed, including Giraph [26], Pregel+ [27],
GraphX [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and so on. Like Pregel, these
systems maintain the entire graph in the main memory
during the computing procedure. GraM [28] is a
new system architecture that exploits the
beneifts of multi-core and RDMA, where
communica[7] A. Eisenman, K. K. Matam, S. Ingram, 2020, pp. 1077–1091. URL: https://doi.org/10.
      </p>
      <p>D. Mudigere, R. Krishnamoorthi, K. Nair, 1145/3373376.3378515. doi:10.1145/3373376.
M. Smelyanskiy, M. Annavaram, Check-n-run: 3378515.
a checkpointing system for training deep learn- [16] Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson,
ing recommendation models, in: 19th USENIX C. Guestrin, J. M. Hellerstein, Graphlab: A
Symposium on Networked Systems Design and new framework for parallel machine learning,
Implementation (NSDI 22), 2022, pp. 929–943. CoRR abs/1408.2041 (2014).
[8] T. D. Chandra, S. Toueg, Unreliable failure [17] A. Kyrola, G. E. Blelloch, C. Guestrin,
detectors for reliable distributed systems, Jour- Graphchi: Large-scale graph computation on
nal of the ACM (JACM) 43 (1996) 225–267. just a PC, in: OSDI, 2012, pp. 31–46.
[9] L. Benson, L. Papke, T. Rabl, Perma- [18] Y. Shen, G. Chen, H. V. Jagadish, W. Lu,
bench: Benchmarking persistent mem- B. C. Ooi, B. M. Tudor, Fast failure recovery
ory access, Proc. VLDB Endow. 15 in distributed graph processing systems, Proc.
(2022) 2463–2476. URL: https://www.vldb. VLDB Endow. 8 (2014) 437–448.
org/pvldb/vol15/p2463-benson.pdf. [19] D. Yan, J. Cheng, H. Chen, C. Long, P. V.
Ban[10] S. Gugnani, A. Kashyap, X. Lu, Un- galore, Lightweight fault tolerance in
pregelderstanding the idiosyncrasies of real per- like systems, in: ICPP, 2019, pp. 69:1–69:10.
sistent memory, Proc. VLDB Endow. 14 [20] Torch distributed elastic, 2021. https:
(2020) 626–639. URL: http://www.vldb.org/ //pytorch.org/docs/stable/distributed.elastic.
pvldb/vol14/p626-gugnani.pdf. doi:10.14778/ html.</p>
      <p>3436905.3436921. [21] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson,
[11] X. Zhu, W. Chen, W. Zheng, X. Ma, Gemini: C. Guestrin, Powergraph: Distributed
graphA computation-centric distributed graph pro- parallel computation on natural graphs, in:
cessing system, in: OSDI, 2016, pp. 301–316. OSDI, 2012, pp. 17–30.
[12] J. Shun, G. E. Blelloch, Ligra: a lightweight [22] S. Salihoglu, J. Widom, GPS: a graph
processgraph processing framework for shared mem- ing system, in: SSDBM, 2013, pp. 22:1–22:12.
ory, in: A. Nicolau, X. Shen, S. P. Ama- [23] J. Leskovec, A. Krevl, SNAP Datasets:
Stanrasinghe, R. W. Vuduc (Eds.), ACM SIG- ford large network dataset collection, http:
PLAN Symposium on Principles and Practice //snap.stanford.edu/data, 2014.
of Parallel Programming, PPoPP ’13, Shen- [24] P. Boldi, S. Vigna, The WebGraph framework
zhen, China, February 23-27, 2013, ACM, I: Compression techniques, in: WWW, 2004,
2013, pp. 135–146. URL: https://doi.org/10. pp. 595–601.
1145/2442516.2442530. doi:10.1145/2442516. [25] D. Yan, Y. Huang, M. Liu, H. Chen, J. Cheng,
2442530. H. Wu, C. Zhang, Graphd: Distributed
vertex[13] C. Chen, J. Yang, M. Lu, T. Wang, Z. Zheng, centric graph processing beyond the memory
Y. Chen, W. Dai, B. He, W.-F. Wong, G. Wu, limit, IEEE Trans. Parallel Distributed Syst.
Y. Zhao, A. Rudof, Optimizing in-memory 29 (2018) 99–114.
database engine for ai-powered on-line decision [26] A. Ching, S. Edunov, M. Kabiljo, D.
Logoaugmentation using persistent memory, Proc. thetis, S. Muthukrishnan, One trillion edges:
VLDB Endow. 14 (2021) 799–812. URL: https: Graph processing at facebook-scale, Proc.
//doi.org/10.14778/3446095.3446102. doi:10. VLDB Endow. 8 (2015) 1804–1815.
14778/3446095.3446102. [27] D. Yan, J. Cheng, Y. Lu, W. Ng, Efective
tech[14] L. Benson, H. Makait, T. Rabl, Viper: niques for message reduction and load
balancAn eficient hybrid pmem-dram key- ing in distributed graph computation, CoRR
value store, Proc. VLDB Endow. 14 abs/1503.00626 (2015).
(2021) 1544–1556. URL: http://www. [28] M. Wu, F. Yang, J. Xue, W. Xiao, Y. Miao,
vldb.org/pvldb/vol14/p1544-benson.pdf. L. Wei, H. Lin, Y. Dai, L. Zhou, Gram: scaling
doi:10.14778/3461535.3461543. graph computation to the trillions, in: SoCC,
[15] Y. Chen, Y. Lu, F. Yang, Q. Wang, Y. Wang, 2015, pp. 408–421.</p>
      <p>J. Shu, Flatstore: An eficient log-structured [29] H. Fu, M. G. Venkata, S. Salman, N. Imam,
key-value storage engine for persistent memory, W. Yu, Shmemgraph: Eficient and balanced
in: J. R. Larus, L. Ceze, K. Strauss (Eds.), AS- graph processing using one-sided
communicaPLOS ’20: Architectural Support for Program- tion, in: 18th IEEE/ACM International
Symming Languages and Operating Systems, Lau- posium on Cluster, Cloud and Grid Computing,
sanne, Switzerland, March 16-20, 2020, ACM, 2018, pp. 513–522.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Malewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Austern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J. C.</given-names>
            <surname>Bik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Dehnert</surname>
          </string-name>
          , I. Horn,
          <string-name>
            <given-names>N.</given-names>
            <surname>Leiser</surname>
          </string-name>
          , G. Czajkowski,
          <article-title>Pregel: a system for large-scale graph processing</article-title>
          ,
          <source>in: SIGMOD</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Crankshaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          , Graphx:
          <article-title>Graph processing in a distributed dataflow framework</article-title>
          ,
          <source>in: OSDI</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>599</fpage>
          -
          <lpage>613</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Jiang,</surname>
          </string-name>
          <article-title>GRAPE: parallelizing sequential graph computations</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>10</volume>
          (
          <year>2017</year>
          )
          <fpage>1889</fpage>
          -
          <lpage>1892</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Vavilapalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Murthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Douglas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Konar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Seth</surname>
          </string-name>
          , et al.,
          <article-title>Apache hadoop yarn: Yet another resource negotiator</article-title>
          ,
          <source>in: Proceedings of the 4th annual Symposium on Cloud Computing</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Rejiba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chamanara</surname>
          </string-name>
          ,
          <article-title>Custom scheduling in kubernetes: A survey on common problems and solution approaches 55 (</article-title>
          <year>2022</year>
          ). URL: https://doi.org/10.1145/3544788. doi:
          <volume>10</volume>
          .1145/ 3544788.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sahni</surname>
          </string-name>
          ,
          <article-title>Preemptive scheduling of uniform processor systems</article-title>
          ,
          <source>Journal of the ACM (JACM) 25</source>
          (
          <year>1978</year>
          )
          <fpage>92</fpage>
          -
          <lpage>101</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>