=Paper=
{{Paper
|id=Vol-3931/short6
|storemode=property
|title=Workload Cost Optimization Using Dynamic Replication in Decentralized Systems
|pdfUrl=https://ceur-ws.org/Vol-3931/short6.pdf
|volume=Vol-3931
|authors=Ryoga Yoshida,Chuan Xiao,Makoto Onizuka
|dblpUrl=https://dblp.org/rec/conf/dolap/Yoshida0O25
}}
==Workload Cost Optimization Using Dynamic Replication in Decentralized Systems==
Workload Cost Optimization Using Dynamic Replication in
Decentralized Systems
Ryoga Yoshida1,β,β , Chuan Xiao1,β and Makoto Onizuka1,β
1
Osaka University, Yamadaoka, Suita, Osaka, 565-0871, Japan
Abstract
Data replication plays a crucial role in decentralized systems by enhancing durability and availability. The ADR algorithm is a dynamic
replication method that optimizes communication costs by adaptively adjusting the number of replicas. However, it overlooks workload
costs, which are critical in real-world applications, leading to suboptimal performance, especially in parallel processing environments.
To address this limitation, we propose an enhanced ADR algorithm that incorporates both communication and computational costs.
Our method refines the cost model by considering the maximum-cost execution path in update transactions, ensuring a more accurate
workload estimation. Additionally, we introduce an improved expansion-contraction test that efficiently optimizes replication placement.
Experimental evaluations across various network topologies demonstrate that the proposed method achieves up to 12% higher throughput
than the existing ADR algorithm, particularly in read-heavy environments. These results indicate that our approach provides a more
balanced and efficient replication strategy, adapting to diverse workload patterns in decentralized systems.
Keywords
dynamic replication, decentralized systems, transaction management
1. Introduction processors.
Specifically, the ADR algorithm [1] is one of these dy-
Data replication is a fundamental technique in decentralized namic replication techniques. It adaptively changes the
systems, where data is replicated and stored across multiple number of replication processors according to the read/up-
processors. When writing data, transactions synchronize date transactions by periodically making the expansion and
update transactions on the replicas across multiple proces- contraction tests. However, it has two issues: (1) it focuses
sors to ensure durability. When reading data, the data can solely on optimizing communication cost and does not con-
be retrieved from any processor holding the latest version, sider workload cost, which is more critical in real-world
thereby maintaining consistency. applications, and (2) prioritizes minimizing communication
The number of processors where the latest data is repli- cost, which can lead to longer overall transaction execution
cated (we call replication processors) is a critical factor in times.
data replication and can significantly impact system perfor- To overcome the above issues, we propose an enhanced
mance; e.g., in systems with read-heavy workloads, if the ADR algorithm. Specifically, in addition to communication
number of replication processors is small, it leads to frequent costs, we consider processor computational costs, allowing
data retrieval from remote replication processors, degrad- for a more accurate estimation of overall workload cost.
ing performance. In contrast, in systems with update-heavy Furthermore, since the execution time of update transac-
workloads, if the number of replication processors is large, it tions in parallel environments is determined by the heaviest
increases the update load and degrades performance. Thus, computational path, we redefine update transaction cost
for read-heavy workloads, the number of replication proces- by focusing only on the maximum cost path, rather than
sors should be large to reduce the number of data retrievals summing up the weights of all paths.
from remote replication processors, while for update-heavy
workloads, the number of replication processors should be
small to decrease the update loads. 2. Preliminaries
The optimal number of replication processors typically
depends on the frequency of read and update transactions In decentralized systems, minimizing the workload cost
on each processor. In many decentralized systems, system across the entire system is a critical factor. We focus on ap-
designers must define the number of replication processors plications that perform system-wide operations with the ob-
statically during the design phase, and then manually adjust jective of reducing overall workload cost. Various dynamic
it during the production phase [1]. However, this approach replication algorithms have been proposed [1, 2, 3, 4, 5],
is suboptimal in environments with frequently fluctuating among which the ADR algorithm [1] is designed to opti-
read/update transactions and is also inefficient due to the mize overall communication cost by adaptively modifying
manual effort required by system designers. To overcome the replication scheme π
. Given its objective, it is consid-
this, dynamic replication techniques are promising in the ered the most relevant algorithm for achieving the goal of
sense that they adaptively adjust the number of replication this study.
DOLAP 2025: 27th International Workshop on Design, Optimization, Lan- 2.1. Replication Scheme
guages and Analytical Processing of Big Data, co-located with EDBT/ICDT
2025, March 25, 2025, Barcelona, Spain A replication scheme π
represents the set of processors that
β
Corresponding author. hold the latest replicas and forms a variable-sized βamoebaβ
Envelope-Open yoshida.ryoga@ist.osaka-u.ac.jp (R. Yoshida);
chuanx@ist.osaka-u.ac.jp (C. Xiao); onizuka@ist.osaka-u.ac.jp
that shifts toward the center of the network of read/write
(M. Onizuka) (read/update) requests. π
is created for each data object.
GLOBE https://sites.google.com/site/chuanxiao1983 (C. Xiao); http: When the number of read requests increases, the ADR al-
//www-bigdata.ist.osaka-u.ac.jp/professor/onizuka/onizuka_en.html gorithm expands π
to reduce the communication cost by
(M. Onizuka) responding to read requests from a local processor or nearby
Β© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
p1 p2
p6 action execution time decreases. In such situations, the
ADR algorithm prioritizes minimizing communication cost,
p4 p5 p7 which can inadvertently prolong overall transaction execu-
p3
tion time.
p8
3. Proposed Method
Figure 1: Replication scheme example, which consists of proces-
sors π4 and π5, depicted in green. This section introduces an improved version of the ADR
algorithm. As noted earlier, the ADR algorithm optimizes
only communication cost, neglecting workload cost, which
processors. In contrast, when the number of write requests
is more crucial in real-world applications. To overcome this
increases, the ADR algorithm shrinks π
to reduce the over-
issue, the proposed method modifies the cost function of
head of updating replicas in π
. Hereafter, the processors
the ADR algorithm and redefines the optimization equation
included in π
are referred to as π
processors.
for a more realistic workload representation.
For data reading, if the processor where a read request oc-
Specifically, in addition to communication costs, the pro-
curs belongs to π
, the processor reads the local replica. If the
posed method considers processor computational costs, al-
processor does not belong to π
, the processor sequentially
lowing for a more accurate estimation of overall workload
sends read requests to its neighboring processors. Once
cost. Furthermore, since update transaction processing time
the request reaches an π
processor, it returns the replica to
in parallel execution environments is determined by the
the requesting processor. For data writing, the replicas in
heaviest computational path, the proposed method redefines
all π
processors are updated synchronously by repeatedly
update transaction cost by focusing only on the maximum
sending the data to neighboring processors.
cost path, rather than summing up the weights of all paths.
As an example, consider the communication network
shown in Figure 1. The replication scheme π
consists of
processors π4 and π5, depicted in green. When reading 3.1. Optimization Formula
data at processor π1, the nearest π
processor is processor We revise the ADR algorithm to optimize the workload cost
π4, from which the data is fetched. When writing data at across the entire system. To optimize the workload cost
processor π1, the update is sequentially sent to π4 and π5 rather than the communication cost, we extend the objective
via π2, and the replicas are updated on those processors. formula using not only the number of communications but
Under the ADR algorithm, the replication scheme π
al- also its associated read/update cost.
ways forms a single connected set of processors. In addition, The workload cost and optimization formulation are de-
π
is created in various object units, such as a tuple, block, or fined as follows:
text file. It is guaranteed that when the read-write pattern
at each processor β the number of reads and writes issued Cworkload (π
) βΆ= β(#U(π£, π
) Γ Cu (π£, π
)
by each processor β is regular, the replication scheme con- π£βπ
verges to the optimal configuration, regardless of the initial + #F(π£, π
) Γ Cr (π£, π
))
scheme [1].
argmin Cworkload (π
)
π
2.2. Expansion and Contraction where π£ denotes a processor in the network, π denotes the set
π
is periodically adjusted through expansion and contrac- of all processors, π
denotes the replication scheme, #U(π£, π
)
tion1 every fixed period. Expansion occurs in systems denotes the number of update transactions, Cu (π£, π
) denotes
with read-intensive workloads, increasing the size of π
(i.e., the cost of an update transaction, #F(π£, π
) denotes the num-
adding more processors to π
) to reduce the communication ber of fetch transactions, and Cr (π£, π
) denotes the cost of a
cost between the requesting processor and the π
proces- read transaction. The goal of the optimization formula is to
sors. In contrast, contraction occurs in systems with write- find the replication scheme π
that minimizes the workload
intensive workloads, decreasing the size of π
(i.e., removing cost. In practice, π
is gradually adjusted to progressively
processors from π
) to reduce the communication cost be- reduce the workload cost as much as possible.
tween the π
processors. Whether to expand or contract π
In the proposed method, #F(π£, π
), Cr (π£, π
), and Cu (π£, π
)
is determined by executing the expansion and contraction are defined as follows:
tests, respectively.
Cu (π£, π
) βΆ= Lu (π£, π
)
2.3. Issues of the ADR Algorithm #F(π£, π
) βΆ= #R(π£, π
)π½(π£, π
)
Cr (π£, π
) βΆ= Lr (π£, π
)
The ADR algorithm focuses solely on optimizing communi-
cation cost and does not consider workload cost, which is where Lu (π£, π
) is the distance to the farthest π
processor
more critical in real-world applications. As a result, it fails from processor π£, π½(π£, π
) is the cache miss rate, and Lr (π£, π
)
to consider disparities in processing costs among processors is the distance from the nearest π
processor to processor π£.
or disparities in the execution times of transactions between When a cache hit occurs, no fetch operation is triggered,
read and write operations. allowing local reads with zero cost. Therefore, the workload
Additionally, in parallel processing environments, there cost at a processor considers only the read costs incurred by
are cases where communication time increases, but trans- fetch operations, which is determined by multiplying the
1
There is also an operation called βSwitchβ. However, since the main number of read transactions #R(π£, π
) by the cache miss rate
algorithm consists primarily of expansion and contraction, it is omitted π½(π£, π
), resulting in #F(π£, π
).
here for simplicity.
2
3.2. Expansion-Contraction Test -center node
1
2
The workload cost is optimized through an expansion-
contraction test, which is executed after every π successfully Distance from
completed transactions. In the expansion-contraction test, Enumerate of -center node is 1
expansion-contraction The one with the most
for each expansion-contraction pattern, the system deter- pattern node is optimal
mines whether to expand expandable processors or shrink Distance from
Assume that updates occur only from the -center node -center node is 2
shrinkable processors by calculating the differential work-
load cost. Figure 2: Illustration of search space reduction in expansion-
Similar to the ADR algorithm, each expansion- contraction tests.
contraction test restricts operations such as expanding or
contracting beyond one hop and forming a discontinuous
π
. These restrictions are imposed due to computational π
-leaf processors due to contraction (|πΆ|). Since all pre-
complexity concerns and the potential excessive fluctuation expansion π
-leaf processors must be contractible, there are
in #F(π£) and #U(π£) before and after expansion-contraction. exactly |πΆ| such processors. Consequently, the worst-case
Next, we explain the method for computing πΏ(π
), the search space is π(|πΆ| + |πΈ| + |πΆ|) = π(|πΈ| + |πΆ|). In Figure 2,
optimal expansion-contraction pattern set that minimizes there are only two groups based on the maximum π
-center
the differential workload cost. πΏ(π
) is calculated as follows: distance: 1 and 2, meaning that only these two groups need
πΏ(π
) = argmin Cworkload (π
πΈπ ,πΆπ ) β Cworkload (π
) to be considered for optimal expansion-contraction patterns.
πΈπ βπΈ,πΆπ βπΆ However, in real scenarios, updates originate from mul-
tiple processors, not just the center processor. In such
= argmin β #U(π£, π
πΈπ ,πΆπ ) Γ Lu (π£, π
πΈπ ,πΆπ )
πΈπ βπΈ,πΆπ βπΆ π£βπ cases, even if the maximum π
-center distance remains un-
changed, the π
-eccentric distance (the maximum shortest
+ β #F(π£, π
) Γ ΞπΈπ ,πΆπ Lr (π£, π
)) distance from an π
processor to any other π
processor) may
π£βπ
vary, requiring separate calculations. By treating these sep-
β β #U(π£, π
) Γ Lu (π£, π
) (1) arately, the search space is proven (proof omitted) to be
π£βπ π((|πΈ| + |πΆ|)2 |ππ
(ππ
)|) where ππ
(ππ
) denotes neighboring
where πΈ denotes the set of expandable processors, and πΆ π
processors of the π
-center processor. Additionally, using
represents the set of contractible processors. π
πΈπ ,πΆπ denotes tree dynamic programming (DP) and the sliding window
the replication scheme after expanding processors πΈπ and technique, the expansion-contraction test can be computed
contracting processors πΆπ . ΞπΈπ ,πΆπ π (π
) denotes π (π
πΈπ ,πΆπ ) β in π(|π | + |ππ
(ππ
)|(|πΈ| + |πΆ|) log(|πΈ| + |πΆ|)) time.
π (π
). Here, assuming that ΞπΈπ ,πΆπ π
= π
πΈπ ,πΆπ β π
is sufficiently
small, it is approximated that #U(π£, π
) = #U(π£, π
πΈπ ,πΆπ ) and
4. Experiments
#F(π£, π
) = #F(π£, π
πΈπ ,πΆπ ).
This section presents experimental evaluations comparing
3.3. Reduction of the Search Space the proposed method 2 with the ADR algorithm across three
characteristic topologies.
Equation 1 requires evaluating all possible patterns, where
each expandable processor can either be expanded or not,
leading to 2|πΈ| possibilities, and each contractible processor 4.1. Experimental Setup
can either be contracted or not, leading to 2|πΆ| possibilities. We conducted the experiments on an EC2 m5.16xlarge in-
A straightforward computation results in an exponential stance using Dejima [6, 7, 8, 9, 10]. Dejima is a decentralized
search space of π(2|πΈ|+|πΆ| ), which is impractical for scalabil- data management system designed for flexible data integra-
ity. Thus, reducing the search space is necessary. tion at the database level with global consistency. Each
Figure 2 illustrates the concept of reducing the search processor was represented by deploying multiple Docker
space (processors and nodes are treated as equivalent in this containers on a single machine. For concurrency control, the
figure). For simplicity, assume that updates originate only Two-Phase Locking (2PL) protocol [11, 12, 13] was adopted.
from the center processor of the π
-tree (we call π
-center The evaluation criterion is throughput. Throughput was
processor) and that all processor-to-processor distances are calculated by dividing the total number of successful trans-
1. After expansion-contraction, processors can be grouped actions (reads and updates) executed across all processors
based on their distance from the π
-center processor, referred by the execution time of 300 seconds. Additionally, the
to as the maximum π
-center distance in this paper. Graphs throughput was measured after the replication scheme π
with identical maximum π
-center distances exhibit the same had converged and stabilized. The replication scheme π
update costs, making total cost dependent solely on read was created at the record level to minimize expansion cost.
operations. Since read costs decrease as π
expands, only the The expansion-contraction test was triggered every π = 5
case with the largest π
set within each group needs to be transactions. The topologies used in the experiments are
considered. Thus, the number of such groups determines shown in Figure 3. The numbers in parentheses indicate the
the search space, which corresponds to the possible values of number of processors (nodes) in each topology.
the maximum π
-center distance after expansion-contraction, We consider two types of transactions: (1) Update that
resulting in a complexity of π(|πΈ| + |πΆ|). modifies a column in a record, and (2) Read that reads all
Only π
-leaf processors can become maximum π
-center columns of a record. The table structure, update method,
processors after expansion-contraction. The number of pos- and read method in the RDBMS adhered to the YCSB [14].
sible types of π
-leaf processors after expansion-contraction
consists of the original π
-leaf processors (|πΆ|), newly ex- 2
source code is available at: https://github.com/OnizukaLab/dejima-
panded π
-leaf processors (|πΈ|), and processors that became dynamic-replication
3
Table 2
Comparison of throughput between the existing method and the
proposed method in Star topology.
Star(4) Line(9) General(10)
Star(4)
Figure 3: Topologies used in the experiments. Read ratio 10 50 90
Existing 41.8 77.2 297.4
Proposed 41.4 76.5 324.6
Table 1 Ratio -1% -1% +9%
Comparison of throughput between min |π
| and max |π
| in Star
topology. Table 3
Star(4) Comparison of throughput between min |π
| and max |π
| in Linear
Read ratio 10 50 90 and General topologies.
min |π
| 41.5 79.5 242.8 Line(9) General(10)
max |π
| 40.1 71.7 327.4 Read ratio 10 50 90 10 50 90
min |π
| 57.9 99.4 298.9 38.0 76.8 279.8
max |π
| 34.2 62.5 286.5 32.8 59.5 284.5
When generating the initial records for each table, record
insertions into the base table of each processor were prop-
Table 4
agated to multiple processors via Dejimaβs data-sharing
Comparison of throughput between the existing method and the
mechanism. In this experiment, 100 records were inserted
proposed method in Linear and General topologies.
into each processor as initial records, and these records
were propagated across the entire system. For example, Line(9) General(10)
in General(10), 100 records are initially inserted into each Read ratio 10 50 90 10 50 90
Existing 53.7 94.5 291.0 37.5 68.0 255.1
processor, resulting in a total of 1,000 records.
Proposed 54.4 95.2 293.6 37.8 74.9 286.1
Ratio +1% +1% +1% +1% +10% +12%
4.2. Experimental Results
4.2.1. Star Topology
4.2.2. Linear and General Topologies
The experimental results for the star topology are shown
The experimental results for Linear(9) and General(10)
in Table 1 and Table 2. Table 1 compares the case where
topologies are shown in Table 3 and Table 4.
the replication scheme π
is minimized, meaning only the
In Linear(9), as shown in Table 3, as the read ratio in-
center processor is part of π
, and the case where π
is max-
creases, the throughput gap between min |π
| and max |π
|
imized, meaning updates are propagated to all processors.
widens, favoring min |π
|. This is because a lower read ra-
Table 2 shows the results of the existing method (the ADR
tio results in a higher proportion of update transactions,
algorithm) and the proposed method, along with their ratio
making a smaller |π
| more advantageous. Table 4 shows
(relative throughput).
that both methods achieve performance close to the optimal
For Star(4), as shown in Table 1, as the read ratio increases,
min |π
| case, demonstrating successful optimization. The
max |π
| achieves higher throughput than min |π
|, with the
reason for the lack of a significant difference between the
performance gap widening at higher read ratios. As the pro-
two methods is that, unlike Star topology, Linear topology
portion of read transactions increases, their impact becomes
has a lower degree of parallelism, which reduces the perfor-
greater than that of update transactions, making it more
mance gap between the methods.
effective to expand π
to reduce workload cost.
In General(10), similar to Linear(9), as shown in Table 3, as
A comparison of the existing method and the proposed
the read ratio increases, the throughput gap between min |π
|
method in Star(4) is shown in Table 2. In the existing method,
and max |π
| widens, favoring min |π
|. Additionally, at a 10%
performance remains stable when the read ratio is low but
read ratio, Table 4 shows that both methods achieve optimal
degrades significantly as the read ratio increases. This is
values with no notable difference. At 50% and 90% read
because the existing method tends to overestimate update
ratios, the proposed method achieves higher throughput
costs in parallel processing environments, leading to an un-
than the existing method. This result, as in Star topology,
necessarily small |π
| and performance degradation in read-
is attributed to the consideration of parallel computation
heavy environments. This overestimation occurs because
and per-processor costs. At a 90% read ratio, the existing
the existing method only considers communication cost,
method underperforms both min |π
| and max |π
|, while the
ignoring cases where updates can be executed concurrently
proposed method surpasses both. This indicates that the
without increasing execution time.
ADR algorithm sometimes converges to a worse solution
In contrast, the proposed method mitigates performance
than either min |π
| or max |π
|, whereas the proposed method
degradation due to its consideration of parallel execution
has the potential to reach an optimal solution that is neither
costs. However, a slight performance decline was still ob-
the smallest nor the largest |π
|.
served, possibly due to the small π value, which affects the
accuracy of statistical data in the expansion-contraction test.
Increasing π could improve the accuracy and lead to better Acknowledgements
performance.
This work is supported by JSPS Kakenhi JP23K17456,
JP23K25157, JP23K28096, and JST CREST JPMJCR22M2.
4
References
[1] O. Wolfson, S. Jajodia, Y. Huang, An adaptive data
replication algorithm, ACM Trans. Database Syst. 22
(1997) 255β314.
[2] D.-W. Sun, G.-R. Chang, S. Gao, L.-Z. Jin, X.-W. Wang,
Modeling a dynamic data replication strategy to in-
crease system availability in cloud computing environ-
ments, in: Journal of Computer Science and Technol-
ogy, 2012, pp. 256β272.
[3] Q. Wei, B. Veeravalli, B. Gong, L. Zeng, D. Feng,
Cdrm: A cost-effective dynamic replication manage-
ment scheme for cloud storage cluster, in: 2010 IEEE
International Conference on Cluster Computing, 2010,
pp. 188β196.
[4] W. Li, Y. Yang, D. Yuan, A novel cost-effective dynamic
data replication strategy for reliability in cloud data
centres, in: 2011 IEEE Ninth International Conference
on Dependable, Autonomic and Secure Computing,
2011, pp. 496β502.
[5] J.-W. Lin, C.-H. Chen, J. M. Chang, Qos-aware data
replication for data-intensive applications in cloud
computing systems, IEEE Transactions on Cloud Com-
puting 1 (2013) 101β115.
[6] O. Lab, Dejima architecture, https://github.com/
OnizukaLab/dejima-prototype, 2023.
[7] Y. Asano, S. Hidaka, Z. Hu, Y. Ishihara, H. Kato, H. Ko,
K. Nakano, M. Onizuka, Y. Sasaki, T. Shimizu, V. Tran,
K. Tsushima, M. Yoshikawa, Making view update
strategies programmable - toward controlling and
sharing distributed data, CoRR abs/1809.10357 (2018).
arXiv:1809.10357 .
[8] Y. Asano, Z. Hu, Y. Ishihara, H. Kato, M. Onizuka,
M. Yoshikawa, Controlling and sharing distributed
data for implementing service alliance, in: BigComp,
IEEE, 2019, pp. 1β4.
[9] Y. Asano, D. Herr, Y. Ishihara, H. Kato, K. Nakano,
M. Onizuka, Y. Sasaki, Flexible framework for data
integration and update propagation: System aspect,
in: BigComp, IEEE, 2019, pp. 1β5.
[10] Z. Hu, M. Onizuka, M. Yoshikawa, Bidirectional collab-
orative data management, Bidirectional Collaborative
Data Management: Collaboration Frameworks for De-
centralized Systems (2024) 63β119.
[11] P. A. Bernstein, V. Hadzilacos, N. Goodman, Con-
currency Control and Recovery in Database Systems,
Addison-Wesley, 1987.
[12] C. H. Papadimitriou, The Theory of Database Concur-
rency Control, Computer Science Press, 1986.
[13] K. P. Eswaran, J. Gray, R. A. Lorie, I. L. Traiger, The no-
tions of consistency and predicate locks in a database
system, Commun. ACM 19 (1976) 624β633.
[14] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan,
R. Sears, Benchmarking cloud serving systems with
ycsb, in: Proceedings of the 1st ACM Symposium on
Cloud Computing, SoCC β10, Association for Comput-
ing Machinery, New York, NY, USA, 2010, p. 143β154.
5