1. Introduction

Non-continuous Distributed Graph Computing System using Persistent Memory

Miaomiao Cheng

0 1 2

Jiujian Chen

jiujian@bit.edu.cn 0 2

Cheng Zhao

1 2

Cheng Chen

1 2

Yongmin Hu

huyongmin@bytedance.com 1 2

Xiaoliang Cong

1 2

Liang Qin

qinliang.touyi@bytedance.com 1 2

Hexiang Lin

linhexiang@bytedance.com 1 2

Rong Hua Li

0 2

Guoren Wang

wanggr@bit.edu.cn 0 2

Lei Zhang

zhanglei.michael@bytedance.com 1 2 0 Beijing Institute of Technology 1 Douyin Vision Co., Ltd 2 Significant memory consumption. For large-scale

Graph computing systems play a critical role in a variety of industrial applications. This study examines ByteDance's graph computing system workload, which challenges the conventional notion of a one-shot, lightweight graph computing task that can scale to trillions of edges. The workload includes both small and large-scale tasks separated by a 1000-second runtime threshold. The majority of the workload is dominated by small-scale tasks submitted arbitrarily, but with high time-sensitive requirements. make up the bulk of computing resources and occur periodically. Therefore, the graph computing system must be capable of pausing running tasks and prioritizing more critical ones. In this paper, we introduce ByteGAP, a non-continuous graph computing system that leverages PMEM's unique features, such as durability, byte-addressability, memory-like access, lower latency, and high capacity. The non-continuous approach uses checkpointing mechanisms to achieve efective fault detection and recovery. ByteGAP provides two key contributions: (1) lightweight distributed checkpointing based on PMEM, (2) eficient dual-mode PMEM management for optimizing PMEM read and write operations. Moreover, we present a comprehensive evaluation method that demonstrates the system's ability to handle the challenges associated with large-scale computing tasks. The findings lay the foundation for future research in distributed graph computing systems and advocate for a non-continuous approach to graph computing.

graph non-continuous graph processing persistent memory

1. Introduction

dustry[ 1, 2, 3 ], catering to a variety of application scenarios. In ByteDance, thousands of graph comous scenarios such as recommendations, fraud detection, and content auditing. To better understand the characteristics of real-world graph computing

Joint Workshops at 49th International Conference on Very Large Data Bases (VLDBW’23) —Workshop on Accelerat

ing Analytics and Data Management Systems (ADMS’23), 100% ploying large-scale graph computation on PMEM se90% R#Teasoskurces presents a challenge. To ensure data consistency, rcu80% graph computation requests write operations folseo70% low a certain order [11, 12]. However, keeping the /sakTR6500%% writes to PMEM connected via memory bus in a :#d40% certain order requires a special set of CPU instruclize30% tions such as CLFLUSHOPT or CLWB followed by ram20% SFENSE, which has been proven to be extremely oN10% expensive[13, 14, 15].

0% In this paper, we propose a non-continuous disFigure 1: Percentage of the number of tasks and re- tributed graph computing system based on PMEM, sources (running time multiplied by used core numbers) ByteGAP. To handle ByteDance-style graph comoccupied for diferent task-scale in ByteDance graph puting workloads, we suggest a non-continuous stratcomputing workloads. Tasks are separated by a run- egy that preserves checkpoints for graph topology time threshold of 1000 seconds. structures and vertex states during the iterative computing process. To increase eficiency, we further specialize in the design of PMEM checkpoints. reach a trillion-scale, and at least 10TB of memory To eficiently store topology structure and vertex is required to store the pure graph data structure. states from various graph computing algorithms, we These characteristics present challenges: design a dual-mode PMEM management strategy • Potential other task blockage due to lack that supports both fixed and variable data allocaof pause capability. Large-scale graph com- tion. In summary, the main contributions of this puting tasks cannot be paused, which may paper are as follows. result in blocking other tasks when preemptive strategies are not in place [ 4, 5, 6 ]. • Lower tolerance for timeliness. Graph tasks must meet application timeliness, such as daily updated search tasks. For small-scale tasks, timeliness requirements can be ensured through a retry when encountering execution failures. However, large-scale tasks have higher retry costs, and when cluster failures occur during computation[7, 8],it may be impossible to satisfy the task’s timeliness requirement.

Therefore, to address these challenges, graph computing systems must support checkpoint mechanisms and implement efective non-continuous graph computing techniques to ensure high availability within the graph computing cluster. The noncontinuous graph computing system supports effective fault detection and recovery strategies and allows high priority tasks to interrupt low-priority.

This approach ofers significant advantages for largescale graph computing systems.

Large-scale graph computing workloads require a large amount of storage space, with DRAM deliv- The rest of this paper is organized as follows. ering exceptional performance and SSDs providing §2 describes the overall system architecture. We persistent checkpoint storage and extending DRAM. discuss the PMEM-based checkpoint-centric system In addition, PMEM boasts unique characteristics, design in §3. Experimental evaluation is reported such as durability, byte-addressability, memory-like in §4, and §5 reviews the related works. At last, we access, lower latency, and high capacity[9, 10], ren- conclude our work in §6. dering it a better choice for constructing checkpointcentric graph computing systems. However, de• Lightweight distributed checkpointing. Byte

GAP supports eficient, lightweight distributed checkpoints based on PMEM to facilitate efective non-continuous large-scale graph computation. Tailored to graph computation by considering both graph topology structure and vertex state data. We have developed a checkpoint saving mechanism using PMEM to improve system reliability and performance. • Eficient dual-mode PMEM management.

To accommodate multiple data patterns in the system, ByteGAP integrates a carefullydesigned dual-mode persistent memory manager to handle and optimize PMEM read and write operations. To achieve high data access performance, the persistent memory manager employs an innovative dual-model PMEM allocator design, integrating both pool-based and log-based PMEM allocators to support fixed-size data and non-fixed-size data storage, respectively.

Disk / HDFS / . . Compute() Combine() Aggregate() Memory Manager. Batch Processing represents a

Application collection of computing operations that we will disLGoraadpehr DRuemsupletr Batch Processing cmuuslstiipnle§t3h.1re.aIdts,ingveonkereasttinhge mcoemsspaguetsinagndmuepthdoadtining ComSmeunndiecration Thread Pool Vertex Table t-engPCA tphreocveaslsuinegofavnedrtciocems.mMunoirceaotvieorn, aalrletmhraenadagseudsebdyina

Thread Pool. To perform a graph algorithm comReceiver CheckpoCinotmMpauntianggerKernel putation, workers first use the Graph Loader to load the graph from the Hadoop Distributed File Edge Table Message Table Checkpoints System (HDFS) or local file system, where users Persistent Memory Manager can self-define the input data format of vertices and edges in the files. Loaded data will be partiFigure 2: The system architecture of ByteGAP tioned among distributed workers and finally saved on DRAM for Vertex Table and persistent memory for Edge Table respectively, with contiguous 2. System Architecture storage to obtain the performance benefits of sequential read/write. The algorithm is conducted in We depict the architecture of ByteGAP in Figure 2, rounds of supersteps, where every worker processes omitting some trivial connections between compo- the computation in batches and shufles messages to nents and data for readability. The system can be others in each superstep. When the last superstep seen as consisting of four parts (i.e., Application, is finished, the Result Dumper will write the results CP-Agent, Computing Kernel and Persistent Mem- back to the storage devices in a user-defined format ory Manager). Each node in the ByteGAP cluster and terminate the program. will deploy only one process to perform graph com- The underlying part is the Persistent Memory putation while fully occupying all CPU resources. Manager (§3.3), which handles access to PMEM

The global runtime environment is managed by devices and persistent data management (includthe CP-Agent. We build agents to spawn, ren- ing data storage, indexing, memory allocation, and dezvous, and monitor workers across all nodes. They garbage collection). Benefiting from the byteplay a crucial role in achieving the checkpointing addressable and good read/write performance of for our system. Beyond the system layer, user- PMEM, we leverage it to store massive data in both defined graph algorithms are written as Applica- Edge/Message Tables and CheckPoints. For the tion by specifying and injecting exposed interfaces. consideration of checkpointing, the Checkpoint ManThe Compute() method serves as the kernel for ager (§3.2) will generate and manage checkpoints vertex-centric graph processing. It is defined by the happened at certain supersteps. When entering a users to implement the specific computation logic failure recovery process, workers will use these checkover each vertex along with its neighbors, follow- points to restore themselves to continue distributed ing the classic ‘think like a vertex’ programming computing. We heavily engineered our system to paradigm, based on various distributed graph algo- be adapted for persistent memory, to achieve the rithms. In addition, there are two optional functions best system performance with checkpointing. that users can customize. The Combine() method is used to combine these messages targeting the same destination vertex locally, reducing communication 3. System Design overhead across nodes. The Aggregate() method periodically aggregates global intermediate result- We first introduce the computing kernel of ByteGAP s/statistics across nodes during computation, acting in §3.1, which is the essential part of the system’s as a signal for global algorithm behaviors or system execution. Then, we will discuss how to achieve fast controls. recovery with checkpoints in §3.2, and discuss how

Computing Kernel is the core part of ByteGAP. to store and manage data in persistent memory in It executes applications based on the BSP model, §3.3. using user-defined methods in iterations. We further divide the worker into several components ac- 3.1. Computing Kernel cording to their functions. The Sender and the Receiver drives the tasks between computation and In this section, we will introduce the basic data communication, including combining messages and layout and computing module of ByteGAP. constructing the Message Table in the Persistent

MEM is expected that communication costs will overlap ID Value Active Of set with computation in such pipelines.

1 1.0 true 0 To reduce the total amount of messages processed Machine 1 thread 1 2 1.0 true 2 in each superstep, we take message properties-based thread 2 34 11..00 ftarulsee 56 approaches in diferent scenarios. When multiple 5 1.0 true 8 messages sent to a vertex can be combined, a userMachine 2 thread 1 6 1.0 true 5 NVM defined Combiner is adopted in both Sender and thread 2 ... ... ... ... Receiver. In one sending batch, messages will be combined at the same destination. And after being

Vertex Table Edge Table Msg Table received, messages will also be merged into one Figure 3: The data layout of ByteGAP ifnal message during generating the Message Table. Therefore, the indices of batches of messages can be easily calculated by the ofsets in the Message Table for computing dispatchers in the next superstep.

Page-based data layout. CSR is a widely used for- We also use mirror vertices to reduce communicamat for storing sparse graphs and matrices in graph tion costs when each vertex sends the same message systems [11, 16, 17]. We employ CSR to store graph to its neighbors. A Remote Message Table is built data for the benefit of cache prefetching and eficient alongside the resulting Message Table. When a verdata access. Figure 3 depicts the data layout format tex needs to send an identical message to multiple of ByteGAP. All data associated with vertices is neighbors on a node, there will be a mirror vertex in stored as a table in memory for eficient updating the Remote Message Table on the target node. Mesand fast access. Typically, each vertex takes one row sage passing only occurs between the original vertex in the array, storing the vertex ID, the vertex value, and its mirrors. Neighbors then fetch messages from the active state, and an ofset to the start position the mirrors to fill the Message Table for the next of its adjacency-list in the Edge Page. Each active superstep. For ease of programming, mirror vertices state and vertex value is modifiable at run-time for are transparent to algorithms and improvements are algorithm processing. To adapt the CSR format on optional to reduce communication costs. We also PMEM, we organize the adjacency lists of all ver- take optimization for vertices with huge degrees. tices into a global linked page-list (i.e., Edge Table), They will have mirror vertices in each node and the where each page has a fixed length to store a fixed messages passing to them are first gathered locally number of edges. Then, the ofset of [] and after sending them out. To prevent concurrent write [+1] can clearly indicate the start and end po- conflicts on such mirrors, we collect messages within sition (i.e., page number and ofset on the page) of each thread and combine them upon completion of the adjacency-list of [] . Accordingly, to fetch each superstep. Based on the skewed distribution the neighbors of one given vertex, we can iterate the of graph vertices, we dynamically select the threshpages on the Edge Table by sequentially reading on old for degrees, trading mirror memory usage for PMEM, which performs performance (i.e., latency reduced global communication costs. and throughput) comparable to DRAM.

In addition, we also need to iterate over messages belonging to each vertex during the computation. 3.2. Lightweight Distributed Checkpointing Each message includes a source vertex ID and mes- To facilitate the recovery of our computing worksage data. Similar to the Edge Table, we also ar- ers, we construct checkpoints at periodic number of range all message data into a global list of linked supersteps (called ) that need to do checkpoint. pages on PMEM, called the Message Table. For ease of access, Checkpoint Manager saves check

Message passing. While computing, vertex can points as key-value pairs in the PMEM store for send messages to any other vertices to be processed fast retrieving after process restart. The keys are in the next superstep. Message passing between generated based on both the identifiers of each sysvertices may appear across diferent workers. The tem component and the current system superstep two components of ByteGAP, Sender and Receiver, number, where the values are the specific states are used to manage this process between workers. or computing data that need to be stored inside Alongside the compute dispatcher, Sender grabs checkpoints. messages from compute threads and sends them Leader-follower checkpointing. Checkpointing ocout once the message count reaches a batch size. curs at the beginning of each , then each worker Meanwhile, Receiver accepts these streaming mes- will enter into a specific routine to construct the sages and saves them to the Message Table. It checkpoint individually. These checkpoints exist on the local persistent memory of each node, where we regular supersteps, it will be overwritten by new indiscuss its format in 3.3. Note that to ensure the coming messages iteratively at each superstep. The integrity of the newly generated checkpoint before other two bufers work interchangeably for checksafely removing the previous checkpoint, all nodes pointing rounds, in which case it can ensure that should set up a global barrier before the atomic the latest checkpointed messages are always availcheckpoint swap. In addition, we assign a node to able even as we write the next checkpoint. We be the worker leader, for using it to generate the only need to store an index to track which bufer system’s unique Last Checkpoint ID (LCI). LCI holds the latest messages at each . Therefore, the is a basic concept for the checkpoint mechanism, overall checkpointing overhead is lightweight due which indicates what the last global is. We only to the spread of data in our system. And during allow the leader to generate and store the LCI, and recovery, we only need to restore the indices of the broadcast it to other followers during the recovery, key-value pairs in persistent memory, as we will to avoid inconsistent distribution, considering if one describe below. worker fails at checkpointing while the others suc- Checkpointing with agent. Inspired by Torch Discessfully save the new checkpoint. Therefore, at the tributed Elastic [20], which enables PyTorch to run end of the checkpointing routine, each follower will in a fault-tolerant manner, we achieve our CP-Agent send a completed notification to the leader after its to manage distributed computing workers in Bytecheckpoints have persisted. And only after receiving GAP. To assemble the essential environment for all completed messages, the leader can update the communication and monitor the running status of LCI and acknowledge it to followers by notifying each computing process. It is commonly assumed them that the previous checkpoint is out of date. that as cluster size increases, the chance of the entire When CP-Agents relaunch all workers, the leader system failing due to the failure of a single worker will load the LCI from its persistent layer and broad- increases as well. Therefore, how to solve retrieval cast it to all followers to restore the checkpointed in a distributed graph computing system becomes data. a primary concern in our design. The basic idea

Lightweight checkpoints. After loading the graph, is to re-spawn working processes with the help of topologies, i.e. edges and partition information, are CP-Agent, where CP-Agent ensures that the eninvolved in every computation and message sending. tire program can be fully executed or terminated They will not change during the computation of normally. If some processes encounter fatal errors many algorithms, so that only one checkpoint for during the computation, CP-Agent will charge the them is required before the first superstep of com- recovery of process rank, communication environputing. And at each superstep, as described in §3.1, ment, and worker monitoring. the computing dispatchers need to use the updated states and messages received from local vertices from 3.3. Dual-Mode Persistent Memory the last superstep, which are required to be saved in checkpoint at each . We adopt a lightweight Management checkpointing approach that is dedicated to reduc- We use the Persistent Memory Manager (PMM) to ing the extra overhead based on whether the data manage ByteGAP persistent memory space for highis mutable or not and which memory they mainly performance PMEM access. PMM mainly supports use. While edges are usually immutable and persist data storage for Edge Table, Message Table, and only once at the beginning, vertex-related data, i.e., checkpoints. We conduct as the minimum alloVertex Table and Message Table, take up the bulk cation unit for PMEM with a fixed length. A series of the workload when checkpointing. of pages can form the Edge Table and the Message

For vertices stored in DRAM, we serialize them Table as a linked page-list. To facilitate optimal down to the PMEM layer in parallel to make full data access and allocation, we set the page size to use of PMEM bandwidth at minimum cost. The be the OS physical page size. While checkpointing message checkpoint is thoughtfully considered in data can be arbitrary in length, PMM also needs other works [18, 19] as its size can be proportional to support dynamic memory allocation. Further, as to the number of edges. However, we have already we mentioned in §3.2, PMM should organize stored placed messages in the PMEM as Message Table elements as key-value pairs, to better categorize before storing checkpoints. We only need to treat diferent checkpointed components for their on- and them as checkpoints and keep them available until of-load. the next checkpoint is generated. Dual-mode allocator. Based on the above reasons,

Specifically, we use three bufers in Message Table we adopt both pool-based and log-based allocators to store received messages. One bufer is used for for fixed and non-fixed data storage, respectively, PMEM device T1 Dealocate

Object Cache

New chunk

Object

Cache T2 Alocate Dealocate

Global free list

Object Cache

PdeMvEicMe cNhuenwk

Object Cache

Large chunk

Alocate T3

T3 Dealocate Alocate

Large size alocation may occur during asynchronous deletion and CPAgent will restart all workers after that. Then, in the progress of PMM’s recovery through PMEM devices scanning, we can directly drop those useless records and remain the useful records based on their lifetime type. Therefore, we classify all data managed by the PMM into three types of lifetime:

Keep, Delete, Exclusively Keep. (a) Pool-based allocator

(b) Log-based allocator to achieve eficient data management (i.e., allocation, deletion, garbage collection) on PMEM. To meet fixed size memory allocation requirements, the pool-based allocator implements a shared memory management strategy as follows. As illustrated in ifgure 4a, when a thread attempts to allocate an object, it first searches through its own private object cache to get a free object. If no free object is left, it will fetch free objects from a global free-list to refill its own local object cache. Only if there are not enough free objects in the global free-list, the thread will allocate a new chunk from the PMEM device, and using this new chunk to generate free objects to refill the object cache. When a thread wants to deallocate an object, the thread first sets PMM supports data recovery when workers perthe tombstone bit of the deleted object and then form unexpected exits. Note that we reduce the adds the object to the private object cache. If the recovery process overhead by using PMEM runthread finds too many free objects (e.g., twice the time storage. Unlike other systems [ 1, 21, 22, 19 ], bulk load size) in the object cache, it will move some which recovers the Message Table via recomputfree objects from its private free-list to the global ing or reloading from disk log, we directly store free-list. Obviously, such a strategy can provide and fetch the Message Table in PMEM at runtime a pool-based allocator with excellent data locality without any further operations. and high performance for data allocation. While, for the log-based allocator depicted in figure 4b, it manages the memory space in chunks, but each 4. Experiments lfcccerhhaanutuggonnmtrkkhecssmhn,ttahaoaryetganiehocsnnaeetvithwetoelnayiaemdlealidpolfelsrcrooaetcvntoaietotmendsdiiazgotercfa.ahotulTbeonhjcdekaeaclttdiltoasuygw.fr-ribionStampghseedvecdaaxificrtiaasialatlllibodyn-l,ege- tseBhunyepvtiperreGoosrnAutmlPftoesrnaotigfmsr.oasuprthoevcpoarlmouvapituditeoinnscgoa.flaTBbhyltieseGnsoeAcntP-icooinnntrvieanpruioooruutsss when the allocator finds a highly fragmented chunk, it first scans the chunk to locate all valid records of 4.1. Experimental Setup the chunk, and then copies those valid records to Testbed. We evaluated ByteGAP on a cluster of a new chunk for seeking tighter data arrangement. up to 16 machines, each equipped with two Intel Next, the allocator maps the memory access of the Xeon Platinum 8260 CPUs (48 cores in total) and migration record to the new memory address, and 128GB of DRAM. Each node also contains 320GB ifnally reclaims the memory space of the old chunk. of SSD and 512GB of Optane DC PMM, which will

Persistent data lifetime. For persistent data be configured in memory mode and app-direct mode stored in PMM belonging to diferent system com- depending on the situations in our evaluation. ponents, we should use diferent types to manage Datasets. We adopt four real-world datasets for their lifetime at runtime. For example, when a new our evaluation. Table 1 shows the statistics of these checkpoint is generated, the old checkpoint does not four graphs: Twitter and Friendster from SNAP [23], need to be maintained. While deleting an outdated and UK-2007 and UK-union from LWA [24]. Twitter checkpoint is not an atomic operation, a failure and Friendster are the graphs that belong to the

To fully evaluate performance, we recorded these time metrics for ByteGAP on 2 to 10 machines as shown in Figure 5a-figure 5c. There is one CPAgent on each machine to keep PMM and Computing Kernel healthy, and each worker will only save social networking domain, while UK-2007 and UK- checkpoints to and load back from its local storage. union are both Internet graphs. The number of Figure 5a depicts the time metrics for the first two edges in each graph reaches the billion scale, and phases as described above. The time 0 of writthe degrees of Internet graphs are more than social ing the initial checkpoint only takes up once in the graphs. In addition, we generated a weighted graph whole running time. It is not as much scalable as ffroormea0chtoda1t0a0setot beyacrhaneddgoem. ly assigning an integer of gbraepcahu.seHiotwinecvleurd, etshteimaeddoiftsioanvainlgctohset tfooproelaocghy

Algorithms. We evaluated our system by four superstep is always much smaller than normal representative algorithms: PageRank, Connected execution. As the number of machines increases, Components (CC), Breadth First Search (BFS) and the average execution time required for a superstep Shortest Single Source Path (SSSP). We ran these decreases proportionally. And also decreases algorithms against other out-of-core systems for and stays at about one-third of . comparison. And since PageRank is implemented We ran ByteGAP to evaluate checkpoint perforin all systems, we use it as a core test for several mance. The PMM used SSD as the underlying experiments. storage, though our system supports this use case, it lacks optimization for PMEM to improve I/O 4.2. Checkpointing Evaluation eficiency. Figure 5b shows that checkpoints were faster on SSDs than PMEM. This is because the We also examined the pros and cons of checkpoint- operating system has optimizations for SSD writes, ing in ByteGAP. To evaluate the elapsed time on including caching, which PMEM cannot currently checkpointing and recovery, we ran PageRank on the use. We also ran ByteGAP in a testbed where the Twitter dataset and configured the system to write PMM uses a normal path on SSD as the underlying checkpoints at each 10 superstep. In the next su- storage to inspect the time metrics about recovery. perstep after saving a checkpoint, one of computing Although naturally supported in our system, such kernels will kill itself to simulate a worker stopping. a use case lacks the specialized design for PMEM, The PageRank algorithm ensures that each super- which can improve I/O eficiency. Figure 5c illusstep has the same workload of computation and can trates the time of recovery with diferent machines. generate the same number of messages. After the Each of the reported time decreases exponentially worker stops, CP-Agent will relaunch and guide the as the number of machines grows. Starting with a system to begin recovery. There are four important few machines, of running on PMEM is much phases in failure and recovery. We represent the less than running on SSDs. However, when there time metrics of this process as follows: are more machines being involved, both of them • : the time of each superstep without sav- perform close metrics. Despite hardware limitations, ing checkpoints, used to indicate the elapsed ByteGAP benefits from PMEM features and does time of normal execution for comparison with not lose performance when using SSDs on multiple checkpointing overhead. In our settings, this machines. is the average time of ten supersteps. We also evaluated ByteGAP versus GraphX (Spark 3.0) on 2 to 10 PMEM-equipped servers • 0 and : the time to write all needed to compare checkpoint performance. GraphX is a checkpoints when the worker executes nor- widely used graph processing system. For GraphX, mally. We use 0 to denote time of the we killed processes during computation to simulate initial checkpointing before first superstep task failure and recovery. is obtained by suband use as the time of saving checkpoints tracting the time of normal running tasks from failed at the latest superstep. and recovered tasks. is also represented by the • : the time of the recovery by checkpoints. average time of the first 10 iterations. Since GraphX

As our agents will also kill survival workers checkpointing time involves synchronous copies and is dificult to measure, the overhead is also delivered a smaller execution time compared to ussmall. This experiment did not statistically analyze ing Memory mode with the same algorithm and but mainly focused on measuring the time of the same number of machines. As in the case that . Under 10 machines, we found that the average computing connected components on the Twitter time required for a single iteration of GraphX is 27.1 dataset in 8 machines, it takes 25.19s with App seconds, while of GraphX takes 23.5 seconds. Direct mode and 55.29s with Memory mode. This The average time required for a single iteration of indicates two reasons: (1) the data storage we deByteGAP is only 4.3 seconds, while takes only signed is more suitable for I/O with direct access, 2.7 seconds. ByteGAP achieved much shorter and (2) using memory mode may result in memand than GraphX. For time comparison with ory swapping too frequently, reducing ByteGAP’s 2 to 10 machines is shown in figure 5d, ByteGAP performance. is much faster than GraphX for all numbers of ma- Our experiments verify that increasing the numchines. Because GraphX uses Spark’s checkpoint ber of machines can reduce ByteGAP execution function, only a portion of the computing resources time. The system benefits from more CPU resources are used after recovery in the beginning, so when and higher parallelism in both configurations as the there are too many computing nodes, the entire re- number of machines increases, as more machines can sources cannot be used immediately after recovery, provide more CPU resources. For example, when which can lead to resource waste. running PageRank on the Twitter dataset, using Memory mode takes 279.64s at 2 machines and only 4.3. Scalability 55.29s at 8 machines, and using App Direct mode takes 204.8s at 2 machines and only 39.77s at 8 We examine the scalability of ByteGAP in this sec- machines, depicting more machines needing to use tion to evaluate the relationship between execution less time. This means that we can achieve better time and cluster size. Intuitively, processing perfor- performance of our system by adding more machines mance can be improved by increasing the number to improve parallelism. However, adding more maof machines. Since persistent memory can be con- chines to the cluster may involve new overhead, such ifgured as Memory mode and App Direct mode, we as system consistency guarantees, network commuevaluated scalability by the running time in both nication overhead across machines, and hardware configurations with up to 10 machines. Specifically, access overhead, all of which tend to degrade system we adopted two datasets, Twitter and UK-union, performance. Experiments show that the perforand used 2, 4, 6, 8, and 10 machines to run PageR- mance of using App Direct mode is worse than ank and CC algorithms with two modes respectively. using Memory mode at 10 machines while computFigure 6 reports the results of our evaluation. It ing PageRank on Twitter and CC on UK-union. It takes diferent time when using Memory mode and also indicates that performance degradation using App Direct mode as depicted. Therefore, in almost App Direct mode occurs earlier than using Memory all experiments, ByteGAP using App Direct mode 0 2 Time/s 200 100

GByratepGhCAhPi 2,852.47 2,850.97 GByratepGhCAhPi 5,887.82 We have mounted the Optane PMM to an indepenise/Tm216.914,322.82 316.15 575.813,769.78 884.17 ise/Tm159.931,424.68 349.123,594.3 1,164.38,736.29 2,536.6 ifltdehesinystsptpeaamtthhbasynodOthlSea.atvAeiltlacslalynsDtbeRmeAtsMraearaetsecdtohnaefigsumaraesidtnatnmode-uamsleoonrye.

Twitter Friendster UK-2007 UK-union Twitter Friendster UK-2007 UK-union While other systems accept the edgelist file format, (a) PageRank (b) CC input graphs for GraphChi are preprocessed and Figure 8: ByteGAP vs. GraphChi stored directly on PMEM. We compare the execution time of ByteGAP with the above systems by running PageRank, CC, BFS, and SSSP on all datasets individually. mode. For example, running PageRank on Twitter We first report the execution time compared to dataset using App Direct mode takes 69.59s with 6 GraphChi as shown in Figure 8. There is a difermachines and 39.77s with 8 machines, but it takes ent emphasis on the design of single-machine and 48.38s with 10 machines. However, this did not distributed systems, such as the shared-memory or happen while using Memory mode. As a result, shared-nothing architecture and the synchronization I/O performance remains a bottleneck in the stor- mechanism between supersteps. We sequentially ran age layer of the distributed system with suficient ByteGAP from single machine to multiple machines computing resources. for comparison. To better measure the computing performance of the systems, we recorded the run4.4. Comparison with Out-of-Core Systems ning time around the iteration of graph algorithms In this subsection, we compare the computing per- only for each system. We start with the time of formance of ByteGAP with other out-of-core sys- computing PageRank and connected components tems because they have similar use cases with ex- in one machine by ByteGAP and GraphChi, as deternal storage. GraphChi [17] and GraphD [25] are picted in Figure 8a and Figure 8b. Consequently, well-known disk-based systems running in a single ByteGAP outperforms GraphChi on all datasets. machine and distributed environment, respectively. The Parallel Sliding Windows for accessing disks in

5. Related Works

Recently, many graph computing systems have been proposed following the vertex-centric programming model, which allows each vertex to access the received messages from the last superstep and send new messages to its neighbors after computing the updated value. Following Google’s Pregel system [ 1 ], many Pregel-like open-source systems have been proposed, including Giraph [26], Pregel+ [27], GraphX [ 2 ], and so on. Like Pregel, these systems maintain the entire graph in the main memory during the computing procedure. GraM [28] is a new system architecture that exploits the beneifts of multi-core and RDMA, where communica[7] A. Eisenman, K. K. Matam, S. Ingram, 2020, pp. 1077–1091. URL: https://doi.org/10.

D. Mudigere, R. Krishnamoorthi, K. Nair, 1145/3373376.3378515. doi:10.1145/3373376. M. Smelyanskiy, M. Annavaram, Check-n-run: 3378515. a checkpointing system for training deep learn- [16] Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, ing recommendation models, in: 19th USENIX C. Guestrin, J. M. Hellerstein, Graphlab: A Symposium on Networked Systems Design and new framework for parallel machine learning, Implementation (NSDI 22), 2022, pp. 929–943. CoRR abs/1408.2041 (2014). [8] T. D. Chandra, S. Toueg, Unreliable failure [17] A. Kyrola, G. E. Blelloch, C. Guestrin, detectors for reliable distributed systems, Jour- Graphchi: Large-scale graph computation on nal of the ACM (JACM) 43 (1996) 225–267. just a PC, in: OSDI, 2012, pp. 31–46. [9] L. Benson, L. Papke, T. Rabl, Perma- [18] Y. Shen, G. Chen, H. V. Jagadish, W. Lu, bench: Benchmarking persistent mem- B. C. Ooi, B. M. Tudor, Fast failure recovery ory access, Proc. VLDB Endow. 15 in distributed graph processing systems, Proc. (2022) 2463–2476. URL: https://www.vldb. VLDB Endow. 8 (2014) 437–448. org/pvldb/vol15/p2463-benson.pdf. [19] D. Yan, J. Cheng, H. Chen, C. Long, P. V. Ban[10] S. Gugnani, A. Kashyap, X. Lu, Un- galore, Lightweight fault tolerance in pregelderstanding the idiosyncrasies of real per- like systems, in: ICPP, 2019, pp. 69:1–69:10. sistent memory, Proc. VLDB Endow. 14 [20] Torch distributed elastic, 2021. https: (2020) 626–639. URL: http://www.vldb.org/ //pytorch.org/docs/stable/distributed.elastic. pvldb/vol14/p626-gugnani.pdf. doi:10.14778/ html.

3436905.3436921. [21] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, [11] X. Zhu, W. Chen, W. Zheng, X. Ma, Gemini: C. Guestrin, Powergraph: Distributed graphA computation-centric distributed graph pro- parallel computation on natural graphs, in: cessing system, in: OSDI, 2016, pp. 301–316. OSDI, 2012, pp. 17–30. [12] J. Shun, G. E. Blelloch, Ligra: a lightweight [22] S. Salihoglu, J. Widom, GPS: a graph processgraph processing framework for shared mem- ing system, in: SSDBM, 2013, pp. 22:1–22:12. ory, in: A. Nicolau, X. Shen, S. P. Ama- [23] J. Leskovec, A. Krevl, SNAP Datasets: Stanrasinghe, R. W. Vuduc (Eds.), ACM SIG- ford large network dataset collection, http: PLAN Symposium on Principles and Practice //snap.stanford.edu/data, 2014. of Parallel Programming, PPoPP ’13, Shen- [24] P. Boldi, S. Vigna, The WebGraph framework zhen, China, February 23-27, 2013, ACM, I: Compression techniques, in: WWW, 2004, 2013, pp. 135–146. URL: https://doi.org/10. pp. 595–601. 1145/2442516.2442530. doi:10.1145/2442516. [25] D. Yan, Y. Huang, M. Liu, H. Chen, J. Cheng, 2442530. H. Wu, C. Zhang, Graphd: Distributed vertex[13] C. Chen, J. Yang, M. Lu, T. Wang, Z. Zheng, centric graph processing beyond the memory Y. Chen, W. Dai, B. He, W.-F. Wong, G. Wu, limit, IEEE Trans. Parallel Distributed Syst. Y. Zhao, A. Rudof, Optimizing in-memory 29 (2018) 99–114. database engine for ai-powered on-line decision [26] A. Ching, S. Edunov, M. Kabiljo, D. Logoaugmentation using persistent memory, Proc. thetis, S. Muthukrishnan, One trillion edges: VLDB Endow. 14 (2021) 799–812. URL: https: Graph processing at facebook-scale, Proc. //doi.org/10.14778/3446095.3446102. doi:10. VLDB Endow. 8 (2015) 1804–1815. 14778/3446095.3446102. [27] D. Yan, J. Cheng, Y. Lu, W. Ng, Efective tech[14] L. Benson, H. Makait, T. Rabl, Viper: niques for message reduction and load balancAn eficient hybrid pmem-dram key- ing in distributed graph computation, CoRR value store, Proc. VLDB Endow. 14 abs/1503.00626 (2015). (2021) 1544–1556. URL: http://www. [28] M. Wu, F. Yang, J. Xue, W. Xiao, Y. Miao, vldb.org/pvldb/vol14/p1544-benson.pdf. L. Wei, H. Lin, Y. Dai, L. Zhou, Gram: scaling doi:10.14778/3461535.3461543. graph computation to the trillions, in: SoCC, [15] Y. Chen, Y. Lu, F. Yang, Q. Wang, Y. Wang, 2015, pp. 408–421.

J. Shu, Flatstore: An eficient log-structured [29] H. Fu, M. G. Venkata, S. Salman, N. Imam, key-value storage engine for persistent memory, W. Yu, Shmemgraph: Eficient and balanced in: J. R. Larus, L. Ceze, K. Strauss (Eds.), AS- graph processing using one-sided communicaPLOS ’20: Architectural Support for Program- tion, in: 18th IEEE/ACM International Symming Languages and Operating Systems, Lau- posium on Cluster, Cloud and Grid Computing, sanne, Switzerland, March 16-20, 2020, ACM, 2018, pp. 513–522.

[1]

Malewicz ,

M. H.

Austern ,

A. J. C.

Bik ,

J. C.

Dehnert , I. Horn,

Leiser , G. Czajkowski, Pregel: a system for large-scale graph processing , in: SIGMOD , 2010 , pp. 135 - 146 .

[2]

J. E.

Gonzalez ,

R. S.

Xin ,

Dave ,

Crankshaw ,

M. J.

Franklin , I. Stoica , Graphx: Graph processing in a distributed dataflow framework , in: OSDI , 2014 , pp. 599 - 613 .

[3]

Fan ,

Xu ,

Wu ,

Yu , J. Jiang, GRAPE: parallelizing sequential graph computations , Proc. VLDB Endow . 10 ( 2017 ) 1889 - 1892 .

[4]

V. K.

Vavilapalli ,

A. C.

Murthy ,

Douglas ,

Agarwal ,

Konar ,

Evans ,

Graves ,

Lowe ,

Shah ,

Seth , et al., Apache hadoop yarn: Yet another resource negotiator , in: Proceedings of the 4th annual Symposium on Cloud Computing , 2013 , pp. 1 - 16 .

[5]

Rejiba ,

Chamanara , Custom scheduling in kubernetes: A survey on common problems and solution approaches 55 ( 2022 ). URL: https://doi.org/10.1145/3544788. doi: 10 .1145/ 3544788.

[6]

Gonzalez ,

Sahni , Preemptive scheduling of uniform processor systems , Journal of the ACM (JACM) 25 ( 1978 ) 92 - 101 .