Workload Cost Optimization Using Dynamic Replication in Decentralized Systems

Workload Cost Optimization Using Dynamic Replication in Decentralized Systems RyogaYoshida Osaka University

Yamadaoka 565-0871 Suita Osaka Japan

ChuanXiao Osaka University

Yamadaoka 565-0871 Suita Osaka Japan

MakotoOnizuka Osaka University

Yamadaoka 565-0871 Suita Osaka Japan

Workload Cost Optimization Using Dynamic Replication in Decentralized Systems 1613-0073 551EE289B63BEC029D49DA440ABA439D GROBID - A machine learning software for extracting information from scholarly documents dynamic replication decentralized systems transaction management

Data replication plays a crucial role in decentralized systems by enhancing durability and availability. The ADR algorithm is a dynamic replication method that optimizes communication costs by adaptively adjusting the number of replicas. However, it overlooks workload costs, which are critical in real-world applications, leading to suboptimal performance, especially in parallel processing environments. To address this limitation, we propose an enhanced ADR algorithm that incorporates both communication and computational costs. Our method refines the cost model by considering the maximum-cost execution path in update transactions, ensuring a more accurate workload estimation. Additionally, we introduce an improved expansion-contraction test that efficiently optimizes replication placement. Experimental evaluations across various network topologies demonstrate that the proposed method achieves up to 12% higher throughput than the existing ADR algorithm, particularly in read-heavy environments. These results indicate that our approach provides a more balanced and efficient replication strategy, adapting to diverse workload patterns in decentralized systems.

Introduction

Data replication is a fundamental technique in decentralized systems, where data is replicated and stored across multiple processors. When writing data, transactions synchronize update transactions on the replicas across multiple processors to ensure durability. When reading data, the data can be retrieved from any processor holding the latest version, thereby maintaining consistency.

The number of processors where the latest data is replicated (we call replication processors) is a critical factor in data replication and can significantly impact system performance; e.g., in systems with read-heavy workloads, if the number of replication processors is small, it leads to frequent data retrieval from remote replication processors, degrading performance. In contrast, in systems with update-heavy workloads, if the number of replication processors is large, it increases the update load and degrades performance. Thus, for read-heavy workloads, the number of replication processors should be large to reduce the number of data retrievals from remote replication processors, while for update-heavy workloads, the number of replication processors should be small to decrease the update loads.

The optimal number of replication processors typically depends on the frequency of read and update transactions on each processor. In many decentralized systems, system designers must define the number of replication processors statically during the design phase, and then manually adjust it during the production phase [1]. However, this approach is suboptimal in environments with frequently fluctuating read/update transactions and is also inefficient due to the manual effort required by system designers. To overcome this, dynamic replication techniques are promising in the sense that they adaptively adjust the number of replication Envelope yoshida.ryoga@ist.osaka-u.ac.jp (R. Yoshida); chuanx@ist.osaka-u.ac.jp (C. Xiao); onizuka@ist.osaka-u.ac.jp (M. Onizuka) GLOBE https://sites.google.com/site/chuanxiao1983 (C. Xiao); http: //www-bigdata.ist.osaka-u.ac.jp/professor/onizuka/onizuka_en.html (M. Onizuka) processors.

Specifically, the ADR algorithm [1] is one of these dynamic replication techniques. It adaptively changes the number of replication processors according to the read/update transactions by periodically making the expansion and contraction tests. However, it has two issues: (1) it focuses solely on optimizing communication cost and does not consider workload cost, which is more critical in real-world applications, and (2) prioritizes minimizing communication cost, which can lead to longer overall transaction execution times.

To overcome the above issues, we propose an enhanced ADR algorithm. Specifically, in addition to communication costs, we consider processor computational costs, allowing for a more accurate estimation of overall workload cost. Furthermore, since the execution time of update transactions in parallel environments is determined by the heaviest computational path, we redefine update transaction cost by focusing only on the maximum cost path, rather than summing up the weights of all paths.

Preliminaries

In decentralized systems, minimizing the workload cost across the entire system is a critical factor. We focus on applications that perform system-wide operations with the objective of reducing overall workload cost. Various dynamic replication algorithms have been proposed [1,2,3,4,5], among which the ADR algorithm [1] is designed to optimize overall communication cost by adaptively modifying the replication scheme 𝑅. Given its objective, it is considered the most relevant algorithm for achieving the goal of this study.

Replication Scheme

A replication scheme 𝑅 represents the set of processors that hold the latest replicas and forms a variable-sized "amoeba" that shifts toward the center of the network of read/write (read/update) requests. 𝑅 is created for each data object. When the number of read requests increases, the ADR algorithm expands 𝑅 to reduce the communication cost by responding to read requests from a local processor or nearby processors. In contrast, when the number of write requests increases, the ADR algorithm shrinks 𝑅 to reduce the overhead of updating replicas in 𝑅. Hereafter, the processors included in 𝑅 are referred to as 𝑅 processors.

For data reading, if the processor where a read request occurs belongs to 𝑅, the processor reads the local replica. If the processor does not belong to 𝑅, the processor sequentially sends read requests to its neighboring processors. Once the request reaches an 𝑅 processor, it returns the replica to the requesting processor. For data writing, the replicas in all 𝑅 processors are updated synchronously by repeatedly sending the data to neighboring processors.

As an example, consider the communication network shown in Figure 1. The replication scheme 𝑅 consists of processors 𝑝4 and 𝑝5, depicted in green. When reading data at processor 𝑝1, the nearest 𝑅 processor is processor 𝑝4, from which the data is fetched. When writing data at processor 𝑝1, the update is sequentially sent to 𝑝4 and 𝑝5 via 𝑝2, and the replicas are updated on those processors.

Under the ADR algorithm, the replication scheme 𝑅 always forms a single connected set of processors. In addition, 𝑅 is created in various object units, such as a tuple, block, or text file. It is guaranteed that when the read-write pattern at each processor -the number of reads and writes issued by each processor -is regular, the replication scheme converges to the optimal configuration, regardless of the initial scheme [1].

Expansion and Contraction

𝑅 is periodically adjusted through expansion and contraction1 every fixed period. Expansion occurs in systems with read-intensive workloads, increasing the size of 𝑅 (i.e., adding more processors to 𝑅) to reduce the communication cost between the requesting processor and the 𝑅 processors. In contrast, contraction occurs in systems with writeintensive workloads, decreasing the size of 𝑅 (i.e., removing processors from 𝑅) to reduce the communication cost between the 𝑅 processors. Whether to expand or contract 𝑅 is determined by executing the expansion and contraction tests, respectively.

Issues of the ADR Algorithm

The ADR algorithm focuses solely on optimizing communication cost and does not consider workload cost, which is more critical in real-world applications. As a result, it fails to consider disparities in processing costs among processors or disparities in the execution times of transactions between read and write operations.

Additionally, in parallel processing environments, there are cases where communication time increases, but trans-action execution time decreases. In such situations, the ADR algorithm prioritizes minimizing communication cost, which can inadvertently prolong overall transaction execution time.

Proposed Method

This section introduces an improved version of the ADR algorithm. As noted earlier, the ADR algorithm optimizes only communication cost, neglecting workload cost, which is more crucial in real-world applications. To overcome this issue, the proposed method modifies the cost function of the ADR algorithm and redefines the optimization equation for a more realistic workload representation.

Specifically, in addition to communication costs, the proposed method considers processor computational costs, allowing for a more accurate estimation of overall workload cost. Furthermore, since update transaction processing time in parallel execution environments is determined by the heaviest computational path, the proposed method redefines update transaction cost by focusing only on the maximum cost path, rather than summing up the weights of all paths.

Optimization Formula

We revise the ADR algorithm to optimize the workload cost across the entire system. To optimize the workload cost rather than the communication cost, we extend the objective formula using not only the number of communications but also its associated read/update cost.

The workload cost and optimization formulation are defined as follows:

C workload (𝑅) ∶= ∑ 𝑣∈𝑉 (#U(𝑣, 𝑅) × C u (𝑣, 𝑅) + #F(𝑣, 𝑅) × C r (𝑣, 𝑅)) argmin 𝑅 C workload (𝑅)

where 𝑣 denotes a processor in the network, 𝑉 denotes the set of all processors, 𝑅 denotes the replication scheme, #U(𝑣, 𝑅) denotes the number of update transactions, C u (𝑣, 𝑅) denotes the cost of an update transaction, #F(𝑣, 𝑅) denotes the number of fetch transactions, and C r (𝑣, 𝑅) denotes the cost of a read transaction. The goal of the optimization formula is to find the replication scheme 𝑅 that minimizes the workload cost. In practice, 𝑅 is gradually adjusted to progressively reduce the workload cost as much as possible.

In the proposed method, #F(𝑣, 𝑅), C r (𝑣, 𝑅), and C u (𝑣, 𝑅) are defined as follows:

C u (𝑣, 𝑅) ∶= L u (𝑣, 𝑅) #F(𝑣, 𝑅) ∶= #R(𝑣, 𝑅)𝛽(𝑣, 𝑅) C r (𝑣, 𝑅) ∶= L r (𝑣, 𝑅)

where L u (𝑣, 𝑅) is the distance to the farthest 𝑅 processor from processor 𝑣, 𝛽(𝑣, 𝑅) is the cache miss rate, and L r (𝑣, 𝑅) is the distance from the nearest 𝑅 processor to processor 𝑣.

When a cache hit occurs, no fetch operation is triggered, allowing local reads with zero cost. Therefore, the workload cost at a processor considers only the read costs incurred by fetch operations, which is determined by multiplying the number of read transactions #R(𝑣, 𝑅) by the cache miss rate 𝛽(𝑣, 𝑅), resulting in #F(𝑣, 𝑅).

Expansion-Contraction Test

The workload cost is optimized through an expansioncontraction test, which is executed after every 𝑘 successfully completed transactions. In the expansion-contraction test, for each expansion-contraction pattern, the system determines whether to expand expandable processors or shrink shrinkable processors by calculating the differential workload cost.

Similar to the ADR algorithm, each expansioncontraction test restricts operations such as expanding or contracting beyond one hop and forming a discontinuous 𝑅. These restrictions are imposed due to computational complexity concerns and the potential excessive fluctuation in #F(𝑣) and #U(𝑣) before and after expansion-contraction.

Next, we explain the method for computing 𝛿(𝑅), the optimal expansion-contraction pattern set that minimizes the differential workload cost. 𝛿(𝑅) is calculated as follows:

𝛿(𝑅) = argmin 𝐸 𝑖 ⊆𝐸,𝐶 𝑗 ⊆𝐶 C workload (𝑅 𝐸 𝑖 ,𝐶 𝑗 ) − C workload (𝑅) = argmin 𝐸 𝑖 ⊆𝐸,𝐶 𝑗 ⊆𝐶 ∑ 𝑣∈𝑉 #U(𝑣, 𝑅 𝐸 𝑖 ,𝐶 𝑗 ) × L u (𝑣, 𝑅 𝐸 𝑖 ,𝐶 𝑗 ) + ∑ 𝑣∈𝑉 #F(𝑣, 𝑅) × Δ 𝐸 𝑖 ,𝐶 𝑗 L r (𝑣, 𝑅)) − ∑ 𝑣∈𝑉 #U(𝑣, 𝑅) × L u (𝑣, 𝑅)(1)

where 𝐸 denotes the set of expandable processors, and 𝐶 represents the set of contractible processors. 𝑅 𝐸 𝑖 ,𝐶 𝑗 denotes the replication scheme after expanding processors 𝐸 𝑖 and contracting processors 𝐶 𝑗 . Δ 𝐸 𝑖 ,𝐶 𝑗 𝑓 (𝑅) denotes 𝑓 (𝑅 𝐸 𝑖 ,𝐶 𝑗 ) − 𝑓 (𝑅). Here, assuming that Δ 𝐸 𝑖 ,𝐶 𝑗 𝑅 = 𝑅 𝐸 𝑖 ,𝐶 𝑗 − 𝑅 is sufficiently small, it is approximated that #U(𝑣, 𝑅) = #U(𝑣, 𝑅 𝐸 𝑖 ,𝐶 𝑗 ) and #F(𝑣, 𝑅) = #F(𝑣, 𝑅 𝐸 𝑖 ,𝐶 𝑗 ).

Reduction of the Search Space

Equation 1 requires evaluating all possible patterns, where each expandable processor can either be expanded or not, leading to 2 |𝐸| possibilities, and each contractible processor can either be contracted or not, leading to 2 |𝐶| possibilities. A straightforward computation results in an exponential search space of 𝑂(2 |𝐸|+|𝐶| ), which is impractical for scalability. Thus, reducing the search space is necessary. Figure 2 illustrates the concept of reducing the search space (processors and nodes are treated as equivalent in this figure). For simplicity, assume that updates originate only from the center processor of the 𝑅-tree (we call 𝑅-center processor) and that all processor-to-processor distances are 1. After expansion-contraction, processors can be grouped based on their distance from the 𝑅-center processor, referred to as the maximum 𝑅-center distance in this paper. Graphs with identical maximum 𝑅-center distances exhibit the same update costs, making total cost dependent solely on read operations. Since read costs decrease as 𝑅 expands, only the case with the largest 𝑅 set within each group needs to be considered. Thus, the number of such groups determines the search space, which corresponds to the possible values of the maximum 𝑅-center distance after expansion-contraction, resulting in a complexity of 𝑂(|𝐸| + |𝐶|).

Only 𝑅-leaf processors can become maximum 𝑅-center processors after expansion-contraction. The number of possible types of 𝑅-leaf processors after expansion-contraction consists of the original 𝑅-leaf processors (|𝐶|), newly expanded 𝑅-leaf processors (|𝐸|), and processors that became The one with the most is optimal

Enumerate of expansion-contraction pattern

-center node 1 2

Assume that updates occur only from the node 𝑅-leaf processors due to contraction (|𝐶|). Since all preexpansion 𝑅-leaf processors must be contractible, are exactly |𝐶| such processors. Consequently, the worst-case search space is 𝑂(|𝐶| + |𝐸| + |𝐶|) = 𝑂(|𝐸| + |𝐶|). In Figure 2, there are only two groups based on the maximum 𝑅-center distance: 1 and 2, meaning that only these two groups need to be considered for optimal expansion-contraction patterns. However, in real scenarios, updates originate from multiple processors, not just the center processor. In such cases, even if the maximum 𝑅-center distance remains unchanged, the 𝑅-eccentric distance (the maximum shortest distance from an 𝑅 processor to any other 𝑅 processor) may vary, requiring separate calculations. By treating these separately, the search space is proven (proof omitted) to be 𝑂((|𝐸| + |𝐶|)2 |𝑁 𝑅 (𝜎 𝑅 )|) where 𝑁 𝑅 (𝜎 𝑅 ) denotes neighboring 𝑅 processors of the 𝑅-center processor. Additionally, using tree dynamic programming (DP) and the sliding window technique, the expansion-contraction test can be computed in

𝑂(|𝑉 | + |𝑁 𝑅 (𝜎 𝑅 )|(|𝐸| + |𝐶|) log(|𝐸| + |𝐶|)) time.

Experiments

This section presents experimental evaluations comparing the proposed method 2 with the ADR algorithm across three characteristic topologies.

Experimental Setup

We conducted the experiments on an EC2 m5.16xlarge instance using Dejima [6,7,8,9,10]. Dejima is a decentralized data management system designed for flexible data integration at the database level with global consistency. Each processor was represented by deploying multiple Docker containers on a single machine. For concurrency control, the Two-Phase Locking (2PL) protocol [11,12,13] was adopted. The evaluation criterion is throughput. Throughput was calculated by dividing the total number of successful transactions (reads and updates) executed across all processors by the execution time of 300 seconds. Additionally, the throughput was measured after the replication scheme 𝑅 had converged and stabilized. The replication scheme 𝑅 was created at the record level to minimize expansion cost. The expansion-contraction test was triggered every 𝑘 = 5 transactions. The topologies used in the experiments are shown in Figure 3. The numbers in parentheses indicate the number of processors (nodes) in each topology.

We consider two types of transactions: (1) Update that modifies a column in a record, and (2) Read that reads all columns of a record. The table structure, update method, and read method in the RDBMS adhered to the YCSB [14].

Star(4)

Line (9) General( 10) In this experiment, 100 records were inserted into each processor as initial records, and these records were propagated across the entire system. For example, in General (10), 100 records are initially inserted into each processor, resulting in a total of 1,000 records.

Experimental Results

Star Topology

The experimental results for the star topology are shown in Table 1 and Table 2. Table 1 compares the case where the replication scheme 𝑅 is minimized, meaning only the center processor is part of 𝑅, and the case where 𝑅 is maximized, meaning updates are propagated to all processors. Table 2 shows the results of the existing method (the ADR algorithm) and the proposed method, along with their ratio (relative throughput).

For Star(4), as shown in Table 1, as the read ratio increases, max |𝑅| achieves higher throughput than min |𝑅|, with the performance gap widening at higher read ratios. As the proportion of read transactions increases, their impact becomes greater than that of update transactions, making it more effective to expand 𝑅 to reduce workload cost.

A comparison of the existing method and the proposed method in Star(4) is shown in Table 2. In the existing method, performance remains stable when the read ratio is low but degrades significantly as the read ratio increases. This is because the existing method tends to overestimate update costs in parallel processing environments, leading to an unnecessarily small |𝑅| and performance degradation in readheavy environments. This overestimation occurs because the existing method only considers communication cost, ignoring cases where updates can be executed concurrently without increasing execution time.

In contrast, the proposed method mitigates performance degradation due to its consideration of parallel execution costs. However, a slight performance decline was still observed, possibly due to the small 𝑘 value, which affects the accuracy of statistical data in the expansion-contraction test. Increasing 𝑘 could improve the accuracy and lead to better performance.

Linear and General Topologies

The experimental results for Linear (9) and General (10) topologies are shown in Table 3 and Table 4.

In Linear (9), as shown in Table 3, as the read ratio increases, the throughput gap between min |𝑅| and max |𝑅| widens, favoring min |𝑅|. This is because a lower read ratio results in a higher proportion of update transactions, making a smaller |𝑅| more advantageous. Table 4 shows that both methods achieve performance close to the optimal min |𝑅| case, demonstrating successful optimization. The reason for the lack of a significant difference between the two methods is that, unlike Star topology, Linear topology has a lower degree of parallelism, which reduces the performance gap between the methods.

In General (10), similar to Linear(9), as shown in Table 3, as the read ratio increases, the throughput gap between min |𝑅| and max |𝑅| widens, favoring min |𝑅|. Additionally, at a 10% read ratio, Table 4 shows that both methods achieve optimal values with no notable difference. At 50% and 90% read ratios, the proposed method achieves higher throughput than the existing method. This result, as in Star topology, is attributed to the consideration of parallel computation and per-processor costs. At a 90% read ratio, the existing method underperforms both min |𝑅| and max |𝑅|, while the proposed method surpasses both. This indicates that the ADR algorithm sometimes converges to a worse solution than either min |𝑅| or max |𝑅|, whereas the proposed method has the potential to reach an optimal solution that is neither the smallest nor the largest |𝑅|.

DOLAP 2025: 27th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data, co-located with EDBT/ICDT 2025, March 25, 2025, Barcelona, Spain * Corresponding author.

Figure 1 :1Figure 1: Replication scheme example, which consists of processors 𝑝4 and 𝑝5, depicted in green.

Figure 2 :2Figure 2: Illustration of search space reduction in expansioncontraction tests.

Figure 3 :3Figure 3: Topologies used in the experiments.

Table 11Comparison of throughput between min |𝑅| and max |𝑅| in Star topology. When generating the initial records for each table, record insertions into the base table of each processor were propagated to multiple processors via Dejima's data-sharing mechanism.Star(4)Read ratio105090min |𝑅|41.5 79.5242.8max |𝑅|40.171.7 327.4

Table 22Comparison of throughput between the existing method and the proposed method in Star topology.Star(4)Read ratio105090Existing41.8 77.2297.4Proposed41.4 76.5 324.6Ratio-1%-1%+9%

Table 33Comparison of throughput between min |𝑅| and max |𝑅| in Linear and General topologies.Line(9)General(10)Read ratio105090105090min |𝑅|57.9 99.4 298.938.0 76.8 279.8max |𝑅|34.262.5 286.532.859.5 284.5

Table 44Comparison of throughput between the existing method and the proposed method in Linear and General topologies.Line(9)General(10)Read ratio105090105090Existing53.794.5 291.037.568.0255.1Proposed54.495.2 293.637.874.9 286.1Ratio+1% +1%+1%+1% +10%+12%

There is also an operation called "Switch". However, since the main algorithm consists primarily of expansion and contraction, it is omitted here for simplicity. source code is available at: https://github.com/OnizukaLab/dejimadynamic-replication

Acknowledgements

This work is supported by JSPS Kakenhi JP23K17456, JP23K25157, JP23K28096, and JST CREST JPMJCR22M2.

An adaptive data replication algorithm OWolfson SJajodia YHuang ACM Trans. Database Syst 22 1997 Modeling a dynamic data replication strategy to increase system availability in cloud computing environments D.-WSun G.-RChang SGao L.-ZJin X.-WWang Journal of Computer Science and Technology 2012 Cdrm: A cost-effective dynamic replication management scheme for cloud storage cluster QWei BVeeravalli BGong LZeng DFeng IEEE International Conference on Cluster Computing 2010. 2010 A novel cost-effective dynamic data replication strategy for reliability in cloud data centres WLi YYang DYuan IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing 2011. 2011 Qos-aware data replication for data-intensive applications in cloud computing systems J.-WLin C.-HChen JMChang IEEE Transactions on Cloud Computing 1 2013 OLab Dejima architecture 2023 YAsano SHidaka ZHu YIshihara HKato HKo KNakano MOnizuka YSasaki TShimizu VTran KTsushima MYoshikawa arXiv:1809.10357 Making view update strategies programmable -toward controlling and sharing distributed data 2018 Controlling and sharing distributed data for implementing service alliance YAsano ZHu YIshihara HKato MOnizuka MYoshikawa BigComp, IEEE 2019 Flexible framework for data integration and update propagation: System aspect YAsano DHerr YIshihara HKato KNakano MOnizuka YSasaki BigComp, IEEE 2019 Bidirectional collaborative data management ZHu MOnizuka MYoshikawa Bidirectional Collaborative Data Management: Collaboration Frameworks for Decentralized Systems 2024 PABernstein VHadzilacos NGoodman Concurrency Control and Recovery in Database Systems Addison-Wesley 1987 The Theory of Database Concurrency Control CHPapadimitriou 1986 Computer Science Press The notions of consistency and predicate locks in a database system KPEswaran JGray RALorie ILTraiger Commun. ACM 19 1976 Benchmarking cloud serving systems with ycsb BFCooper ASilberstein ETam RRamakrishnan RSears Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC '10 the 1st ACM Symposium on Cloud Computing, SoCC '10

New York, NY, USA

Association for Computing Machinery 2010