=Paper=
{{Paper
|id=None
|storemode=property
|title=Modelling Snapshot Isolation Performance
|pdfUrl=https://ceur-ws.org/Vol-735/paper5.pdf
|volume=Vol-735
|dblpUrl=https://dblp.org/rec/conf/syrcodis/Vasilik11
}}
==Modelling Snapshot Isolation Performance==
Modelling Snapshot Isolation Performance
c Dmitri Vasilik
Saint Petersburg State University
Dmitri.Vasilik@gmail.com
Abstract has natively supported only EC1 . In [12] we have mea-
sured the performance of the prototype in a single-node
Snapshot Isolation (SI) level is extensively used mode and compared it with that of HBase, but we had not
in commercial database systems. We devel- an opportunity to run the system in a distributed mode.
oped a simple SI implementation protocol for Compared to [12], in this work we present performance
distributed DBMS and implemented it in the model of the protocol and the comparison of the protocol
Apache HBase. The work presents the perfor- simulation results with HBase simulation results for the
mance evaluation of the protocol. We have mea- distributed mode.
sured the performance of a single-node system The remainder of this paper is organized as follows.
and modeled the performance of a distributed In Section 2 we discuss literature works on the DBMS
HBase cluster. performance evaluation with concurrency control perfor-
mance comparison being in focus. Section 3 contains
1 Introduction the description of the protocol and its implementation.
Performance model is provided in Section 4. Section
In the last ten years the special class of data-intensive 5 presents results of conducted simulation experiments.
web-application has emerged. Youtube, Google Maps, Section 6 contains discussion of protocol implementa-
social networks, etc. represent the applications of the tion performance along with model validation issues. In
class. This applications can be characterized by the fol- section 7 the results are summed up.
lowing features: they work with large volumes of data
and typically do not need ACID transactions.
The choice of the concurrency control to employ in 2 Related Work
the application is of great importance, since it affects the A lot of concurrency control algorithms was introduced
performance of the application as well as the complexity in the last 30 years. Considerable research was devoted
and cost of development. For example, choice of even- to evaluating the performance of concurrency control al-
tual consistency (EC) as an isolation level may result in gorithm in the 80s and the earlier 90s.
increased complexity of programming model. Thus, ap- In [1] authors has examined the assumption set em-
plication designer has to carefully look for the most suit- ployed in concurrency control performance (CC) mod-
able compromise between level of guarantees he gets and elling. Authors has investigated the reasons of appar-
performance of isolation level protocols. ent contradictions in performance results of earlier stud-
The Snapshot Isolation (SI) isolation level was intro- ies. In particular, the common assumption of infinite re-
duced in [2]. It is widely adopted by the commercial and sources was critically examined. The paper [11] lists the
open-source systems thanks to its ability to cope with CC methods along with results gained from analytical
read intensive workloads and high degree of consistency and simulation studies. For a given CC method the mod-
guarantees. It is even used instead of serializable isola- els employed for its performance evaluation are briefly
tion level in Oracle and PostgreSQL. This decision can described. In [10] issues in modelling performance are
be justified by level of guarantees SI actually offers. discussed. The paper contains tens of questions to be
The transaction executed under SI reads data from a solved by the researcher before he starts to implement
snapshot of the data as of the time the transaction started. his model along with some critique of using one’s intu-
Two transactions are called competitive if they have over- ition in performance modelling.
lapping execution intervals and they have written the Works [7, 6, 5] are focused on the distributed sys-
same data item. When the transaction is ready to commit, tems performance evaluation, however, this studies do
it is assigned the commit timestamp and in case if there not have the CC methods in the main focus. In [6] perfor-
is no competing transactions in the system it commits. mance models classification based on the model assump-
This feature is called “First-Commiter-Wins”. tion set is provided. The paper [7] contains a detailed
We developed SI protocol for distributed DBMS. We comparison of Shared Disk and Shared Nothing (SN) ar-
have presented it in [12] along with its implementation chitectures. In particular, the performance of Scan and
prototype. We have chosen HBase to implement it in Join operators is discussed. The paper [5] contains a
because it is open source distributed data storage which comprehensive SN DBMS performance study, perfor-
Proceedings of the Spring Researcher’s Colloquium on Database 1 Optimistic concurrency control was the only available alternative
and Information Systems, Moscow, Russia, 2011 for HBase by the moment we started implementation.
mance is modeled using three different workloads, one very simple. HBase API is also restricted to get, put,
of them being generated from database traces. delete and scan operations. The data processing nec-
The work [8] presents an analytical SI performance essary for a query can not be pushed as close as possible
model for a standalone server machine. to data, as it is usually done in distributed RDBMS. If a
Recently there was a number of studies related to SI client wants to update data according to a special rule, it
implementation in distributed systems [14, 4, 13]. The requests the data, updates it locally and puts the updated
work [4] presents a technique to preserve the behaviour data to the storage. That is, complex query execution
of centralized SI system in guaranteeing global SI in dis- is decomposed into two or more phases. As it was said
tributed, lazily replicated system. The technique lever- before, initially only eventual consistency isolation guar-
ages local SI concurrency control. A simulation model antees were provided by HBase. HBase/Bigtable users
was developed to study the cost of using the technique. who needed transactional execution, were suggested to
This model is rather simple in aspects of single-node per- implement it themselves using data timestamps.
formance, e.g. the service time per operation is specified HBase cluster consists of servers of two types: master
as model parameter. nodes and region nodes. Every region server manages a
The article [13] presents a cloud storage system number of regions. HBase uses horizontal partitioning,
named ecStore. The article contains a brief description of each region contains data for an interval of keys.
concurrency control protocols used, but it does not con- HBase is column oriented storage, that is, data is
tain CC performance comparison. stored on a per column basis in main memory and in the
In the paper [14] an approach to use HBase as a cloud file system. Every column store has its own set of files
database solution with global SI is described. The paper and a per file buffer pool employing the classical LRU
presents an optimistic protocol, which uses 4 special ta- discipline to manage pages. HBase use immutable files
bles and several queries to provide SI gaurentees. The to store data. When the total space occupied by updates
protocol is implemented on top of bare-bones HBase on kept in main memory exceeds specified threshold a new
the client side, it’s imlementation is rather high-level, it file is written to the file system. This is done in a special
does not introduce any changes to server code. Authors thread. HBase uses write ahead log (WAL), that is, every
have conducted several experiments in a distributed envi- update is written to log before it gets to main memory
ronment (3-machine cluster) to evaluate the cost of em- store.
ploying the proposed protocol.
In this work we present performance models for our 3.3 Protocol Performance Discussion
protocol and HBase and the comparison of simulation
The “First-Commiter-Wins” feature has an optimistic na-
results. To the best of our knowledge the performance
ture, thus SI definition naturally accepts the optimistic
model of Key Value store like HBase has not been pre-
protocol. Our approach is not optimistic, it should be
sented yet.
rather called “First-Updater-Wins”.
Although the algorithm is nonlocking, it is more close
3 Implementation Protocol to restart oriented locking methods such as immediate-
3.1 Protocol Description restart. Immediate-restart algorithm description, its per-
formance modelling and comparison with blocking and
Let us suppose that current transaction state, its start optimistic methods can be found in [1]. More sophis-
timestamp and end timestamp (if a transaction has been ticated restart oriented methods exist such as wound-
already committed or aborted) are available at the execu- wait or wait-die, which could outperform the immediate-
tion time. restart method in most cases ([11]). We did not attempt
Read operation execution is straightforward. To read to build a protocol on top of this ideas.
a data object x the transaction t performs a selection of Before a data object is written the version check is
all versions of x, committed before start time of t. Then performed. All data object versions made after the trans-
it traverses the obtained list of versions to find the latest action had started must be selected and traversed. It may
committed version. If the list contains version written by take a considerable amount of time, because some of
t, this version is read instead of the latest committed. committed versions of data object may have been already
Before the value of x can be updated the transaction written to the file system. In this case a number of page
has to ensure, that no competitive transaction had already reads is to be performed.
written x. If x was updated by transaction, which was Some tricks could be used to avoid unnecessary disk
committed after t had started or by active transaction, accesses or at least reduce their probability. The main
than t aborts. memory store could be assigned a timestamp each time
We have formally proved the correctness of the proto- its contents are flushed to a new file. The version check
col in [12], but the prove is not in the scope of this work. performed in main memory is sufficient to ensure “First-
Writer-Wins” feature if the transaction started after the
3.2 Data Storage Design Issues
main memory store had been written to disk last time.
We have implemented our protocol in HBase. HBase is The second trick which may be used is to write only
an open-source distributed data storage written in Java. It the oldest versions to file, keeping new ones in memory.
was made after Google Bigtable. In [3] Bigtable is pre- Given average transaction execution time tavg , we can
sented along with design and implementation decisions keep versions with timestamps greater than now()−tavg .
made by developers. To ensure transactional execution of queries in dis-
Authors has called Bigtable “a sparse distributed per- tributed data storage we used the distributed commit pro-
sistent multidimensional sorted map”. Bigtable API is tocol. We have chosen a protocol similar to Two Phase
Commit (2PC) for its implementation simplicity. The Parameters Settings
classical 2PC successfully commit scenario consists of CPU: cores per PE 2
the following steps. The transaction initiator waits for processor capacity (MIPS) 9000
READY message from all the participated servers. When #avg instr.: BOE and EOE steps 5000
all the participants voted to commit, it sends them COM- per object reference 25000
MIT message and waits for ACK messages. After all for message 5000
ACK messages were received, initiator finishes the trans- for IO operation 3000
action. Disks: count per PE 1
Since HBase uses WAL, when subtransaction sends access time mean (ms) 8
READY message to the transaction home node (client page transmission time (ms) 0.2
machine), it has no additional work to do before it can Network: throughput (Mbit/sec) 100
commit. It can not be aborted by other local subtrans- min packet size 540
action. When the transaction initiator receives READY Database: page size (Kb) 4
message from all the execution region servers, it just count of pages per PE 53000000
sends them COMMIT message and commits the trans- count of pages in buffer pool 500000
action locally. It does not wait for ACK messages from
executor servers as in the classical 2PC. Table 1: System parameter settings.
Let us summarize the overheads introduced by the
protocol. • Data accesses are spread among nodes in uniform
manner.
• Employing distributed commit protocol leads to ad-
ditional communication overhead. We do not use a sophisticated model of a communi-
cation network as well, because we believe that network
• Version check may lead to additional (local) disk should not be a bottleneck. In our model there is a vir-
accesses and may consume CPU. tual connection between every query initiator (client) and
executor node with 100Mbit/sec throughput.
• Certain resources are wasted due to additional Table 1 contains the system settings used for experi-
aborts by protocol reasons. ments.
4 Simulation Performance Model 4.2 Workload Model
The simulation model was written in Java using Simkit The workload model is homogeneous, i.e. it consists of
simulation package [9]. queries of one type. As was mentioned above, the com-
Our model is rather hardware resource bounded than plex query is executed in a number of steps. We evaluate
data contention bounded. We believe that HBase was the performance for the query which has two execution
not intended to be used for the workloads with high data steps.
contention. Our model was not aided to solve some fun-
damental questions, it concentrates on answering the en- 1. Several regions are scanned. After the scan results
gineering questions only. We have modeled the particu- are send to the client (query initiator machine).
lar distributed data storage performance, so the model is
much less universal than models employed in mentioned 2. Some of the objects obtained on the first step are
related works. However, the model is rather simple. It updated. Updates take place in the same regions,
abstracts from some details. For example, the primary and thus are performed by the same region servers.
copy replication used in HBase, may affect the perfor- The number of nodes participating in query execu-
mance, but it is not reflected in the model. tion is distributed according to Poisson distribution with
The model will be described in terms of queries, we mean M . The number of data objects query reads per
will use the term transaction when we need to outline the region server is exponentially distributed with the mean
differences. Nr , writes count is distributed in a similar way with the
mean Nw . Reads and writes are distributed among the
4.1 Modelling Assumptions
nodes uniformly.
We use a close system model. There is always the same The 99%-1% data access pattern is used. 99% of
number of queries in the system. After one query had transactions references 1% of data objects. In our model
completed its execution, the new one immediately re- the data access pattern influences both the buffer pool hit
places it. Since the concurrency control performance rate and the data contention.
hardly depends on the active query count in the system, For the model simplicity we assumed that the data ob-
we decided to keep these number constant. ject version has a page size. It is reasonable assumption,
The cluster is modelled as a queueing network, each because HBase was not designated to work with objects
node is itself modeled as a queueing network. of small size. Thus, the granule for CC algorithm is rep-
We use quite common assumption called “homogene- resented by a page in our model.
ity assumption” in [6]. Workload parameters can be found in the Table 2.
• All database sites have the same structure and the 4.3 Query Execution Model
same service capacity.
Each query has a home node. The home node is the client
• The amounts of data allocated to each unit are equal. machine it has arrived to. After a query has arrived, it is
Parameters Settings pages, each of which is already in buffer pool, so disk IO
M 7 is not needed.
Nr 1000 After the write subtransaction has finished execution
Nw 10 it sends READY to the home node and waits for COM-
Nvc 10 MIT or ABORT message in reply. The transaction ini-
data access pattern 99%-1% tiator waits until all READY messages (or one ABORT
percent of sequential IO 5% message) will be received. Than it sends the COMMIT
(ABORT) message to all executors and the transaction
Table 2: Workload parameter settings. is finished. After the subquery received the COMMIT
(ABORT) message, it writes a log page and goes to the
split to a number of subqueries, each of which is local CPU queue to perform the EOE step.
to one of execution nodes. Each subquery is modeled
as a list of data references (pages) to be processed. For 4.5 Abort Probability
each subquery the message is sent to the execution site
to start up the execution. Each query joins the Net queue Let us consider the probability of version check failure.
and consumes CPU time to send a message to execution Suppose the transaction t which was started at the mo-
sites. ment u0 is going to update the value of data object x at
When the subquery start message arrives to the exe- the moment u.
cution site, the subquery is created and it joins the Net The failure of version check may be caused by two
queue on the execution site and consumes CPU time types of transactions that had updated x recently.
needed to read a message. Than it consumes CPU to 1. x may been written by the transaction which was
begin execution. committed after t had started. Given the throughput
The subquery execution consists of Begin-Of- of the server T (u) for the moment u, the number of
Execution (BOE) step, a number of data reference pro- transactions committed in the time interval [u0 , u]
cessing iterations and End-Of-Execution (EOE) step. can be approximated by Nw ∗ T (u) ∗ (u − u0 ).
For each data object reference it goes to the buffer
pool to get the data. If the required page is not already 2. x may been written by the transaction, that has not
in memory, subquery consumes processor time to start already committed. Let A(u) denote the number
IO and joins the disk queue. Than the subquery goes to of write subtransaction active at the server as for
CPU queue and consumes time to process data reference the moment u. The average number of writes made
(to find appropriate version). In case there are some more by active subtransactions may be approximated by
references the subquery repeats a similar data reference Nw /2. The number of writes made by all active
processing iterations. transaction as for a moment u may be estimated
After the subquery has successfully finished its execu- with Nw ∗ A(u)/2.
tion it sends requested data (if present) and the READY The count of updates written by transactions which
message to the home node. After that, the subquery is execution interval overlaps with execution interval of t
finished, so it goes to CPU queue and consumes time for may be approximated by
EOE step.
The client receives the data from all the participated A(u) + 2 ∗ T (u) ∗ (u − u0 )
K(u, u0 ) = Nw ∗
servers, process it and sends a message to each partici- 2
pant to perform updates of the second execution stage. Let us consider the a-b data access pattern and the case
A write subquery is executed in the similar way with when t belongs to majority of transactions. The num-
a read subquery. Its behaviour differs in data processing ber of updates made by transactions of major class is
iteration. The write subquery firstly consumes CPU for a∗K(t, u), while the number of data objects t can access
reference processing and after that writes the log page to is D ∗ b. The probability of observing update made by a
the file system and enqueues the page write (it does not transaction of the same class is
wait until the page will be written).
a ∗ K(u, u0 )
D∗b
4.4 Transaction Execution Model
The probability of observing updates made by transac-
Read subtransactions do not differ from read subqueries tions of the other class is
in their execution model. Although they may consume
more CPU time for a data reference processing to find (1 − a) ∗ K(u, u0 )
appropriate version, the performed steps are same. D
Write subtransaction performs version check before Thus, for a transaction of the first type the probability of
each write. If the version check failed, the transaction is version check failure can be estimated by
to be aborted, and subtransaction goes to the Net node
to send an ABORT message. The probability of version b ∗ (1 − a) + a K(u, u0 )
P1 (u, u0 ) = ∗
check success will be discussed later. b D
The probability that versions needed for version check and for a transaction of the second type by
were not flushed to a file since the transaction had started
is quite high for not write intensive workload. Therefore, K(u, u0 )
P2 (u, u0 ) =
the version check step is modeled as processing of Nvc D
5 Simulation Results
The main goal of our performance study consists in eval-
uation of performance degradation caused by additional
overheads introduced by the protocol. We measured
the performance of the protocol in terms of the system
throughput and average response time.
We evaluated the throughput degradation for a partic-
ular parameter settings as the difference between the EC
and SI throughputs divided by EC throughput.
In this section we use the abbreviation EC to denote
HBase version, which uses eventual consistency, and SI
for HBase, which uses our protocol. For most of the ex-
periments we present two diagrams: one depicts normal
workload results, while the other reflects the situation
when all the available resources are already saturated.
We have payed a special attention to the saturated sys-
tem performance to examine the influence of transaction Figure 1: Normal workload throughput against the active
aborts on the throughput. queries count.
Fig. 1 and 2 plot throughput results against the ac-
tive queries count in the system. The maximum degrada-
tion obtained for normal workload was 3.3% and 7.7%
for saturated system. In the latter case transaction aborts
make significant contribution to the throughput degrada-
tion. As the number of queries in the system increase,
the throughput degradation raises to the mentioned max-
imum of 7.7% for 1400 queries. When the number of
queries is equal to 100, aborts number per second is
less than 30% of difference between EC and SI through-
puts. The rate of aborts count per second to difference of
throughputs reaches the maximum of 98.8% at the point
of 700 queries.
Fig. 3 and 4 give the average response time against
the active queries count in the system. Relative response
time increase is quite small, it does not exceed 3%. After
the queries count reaches 35 the relative response time
increase does not change significantly. It remains about
3% for the highly saturated workload too. Figure 2: Saturated workload throughput against the ac-
As expected the use of distributed commit protocol tive queries count.
does not make considerable contribution to the average
response time for the long-running transactions. The disk
system appears to be the bottleneck for this kind of work-
load. The performance of the system is substantially in-
fluenced by the buffer pool hit rate.
Fig. 5 and 6 shows the throughput results for EC and
SI against buffer pool hit rate. These experiments were
conducted with the systems running 70 (Fig. 5) and 400
(Fig. 6) transactions. The partition size was varied to
provide needed buffer pool hit rate. The results obtained
from both experiments are similar. In both experiments
systems show significant speedup with buffer pool hit
rate verging towards 1. EC outperforms SI by 0.51-6.3%
in case of normal workload and by 2.9-6.7% otherwise.
Fig. 7 gives response time against buffer pool hit rate
for the system running 70 queries. Response time is not
as sensible to buffer pool hit rate as throughput. The
graph is close to linear. However, relative response time
increase grows rapidly with buffer pool hit rate getting
Figure 3: Normal workload response time against the ac-
closer to 1 and reaches the maximum of 6.25% for nor-
tive queries count.
mal workload and 3.87% for saturated workload.
Figure 4: Saturated workload response time against the Figure 7: Normal workload response time against buffer
active queries count. pool hit rate.
6 Model Validation
We have implemented the protocol in HBase. Initially
we have evaluated the protocol performance using the
prototype. We have measured the prototype performance
for a workload, which consisted of short-running queries.
Each query read and updated 3 values of double type.
The prototype has shown relatively good results for such
workload: the throughput was 3 times lower than that of
initial HBase version.
To validate our model mentioned test set was changed.
If each query writes 3 double values, size of log entries
to be flushed to disk before query finishes is small. So
we replaced double values with 4Kb sized byte arrays.
After the changes had been made we discovered HBase
throughput being ten times greater than that of the proto-
type.
Figure 5: Normal workload throughput against buffer
pool hit rate. We suppose the prototype implementation showed
such a poor performance, because it copies data objects
for internal purposes. For every single get and put op-
eration at least one data object is copied. We suppose
that despite the fact that main memory reads and writes
are extremely fast, copying of data objects may affect the
performance in case of no disk accesses.
The values of model parameters for EC were selected
in such a way, that EC throughput was as close as pos-
sible to HBase throughput. The experiment conducted
with the same parameters using SI model has shown the
throughput 1.5 times lower than for EC (1200 vs. 1800).
To have the SI model performance close to that of the
prototype, we adjusted the mean page processing time
parameter to be 270000 instruction instead of default
25000. The primary goal of this experiment was to val-
idate the abort probability model. Fig. 8 shows the pro-
totype throughput and SI model throughput, and 9 shows
abort rates obtained. The difference between the aborts
rates obtained from the model and the prototype has not
Figure 6: Saturated workload throughput against buffer
exceeded 3% being about 2% at average. We conclude
pool hit rate.
that our model captures the system behaviour adequate.
7 Summary
In this work we presented SI protocol and evaluated its
performance. The advantage of the protocol under EC is
obvious: the data storage user is provided with SI con-
sistency guarantees. However, the protocol introduces
several additional overheads, which may affect the per-
formance. We have modeled the performance for the par-
ticular kind of workload and validated the model using
the prototype we have presented in [12]. Although, the
protocol implementation prototype has shown quite poor
performance, it does not imply that the protocol is inef-
fective itself. The simulation results has shown that use
of the protocol does not lead to significant performance
degradation for this workload type.
References
[1] Rakesh Agrawal, Michael J. Carey, and Miron
Livny. Concurrency control performance model-
ing: Alternatives and implications. ACM Transac-
tions on Database Systems, 12(4):609–654, 1987.
[2] Hal Berenson, Phil Bernstein, Jim Gray, Jim
Figure 8: Model validation: SI throughput. Melton, Elizabeth O’Neil, and Patrick O’Neil. A
critique of ANSI SQL isolation levels. In Proceed-
ings of the 1995 ACM SIGMOD international con-
ference on Management of data, volume 24, pages
1–10, New York, 1995.
[3] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wil-
son C. Hsieh, Deborah A. Wallach, Mike Burrows,
Tushar Chandra, Andrew Fikes, and Robert E. Gru-
ber. Bigtable: A distributed storage system for
structured data. In Proceedings of the 7th USENIX
Symposium on Operating Systems Design and Im-
plementation, volume 7, pages 295–310, 2006.
[4] Khuzaima Daudjee and Kenneth Salem. Lazy
database replication with Snapshot Isolation. In
Proceedings of the 32nd international conference
on Very Large Databases, 2006.
[5] Robert Marek and Erhard Rahm. Performance eval-
uation of parallel transaction processing in Shared
Nothing database systems. In Proceedings of the
4th International PARLE Conference on Parallel
Architectures and Languages Europe, pages 295–
310, Paris, 1992.
[6] Matthias Nicola and Matthias Jarke. Performance
modeling of distributed and replicated databases.
IEEE Transactions on Knowledge and Data Engi-
neering, 12(4), 2000.
[7] Erhard Rahm. Parallel query processing in Shared
Disk database systems. In Proceedings of 5th Int.
Figure 9: Model validation: SI abort rate. Workshop on High Performance Transaction Sys-
tems, Asilomar, 1993.
[8] Pierangelo Di Sanzo, Bruno Ciciani, and
Francesco Quaglia Sapienza. A performance
model of multi-version concurrency control. In
Proceedings of IEEE International Symposium on
Modeling, Analysis and Simulation of Computers
and Telecommunication Systems, pages 1–10,
2008.
[9] Naval Postgraduate School, March 2010. Simkit.
[10] Y.C. Tay. Issues in modelling locking performance.
In Hideaki Takagi, editor, Stochastic Analysis of
Computer and Communication. Elsevier Science
Publishers B.V. (North-Holland), New York, 1990.
[11] Alexander Thomasian. Concurrency control:
Methods, performance, and analysis. ACM Com-
puting Surveys, 30(1), 1998.
[12] Dmitri Vasilik. Implementing Snapshot Isolation
in HBase. Diploma thesis, Saint Petersburg State
University, 2010.
[13] Hoang Tam Vo, Chun Chen, and Beng Chin Ooi.
Towards elastic transactional cloud storage with
range query support. Proceedings of the VLDB En-
dowment, 3(1-2), 2010.
[14] Chen Zhang and Hans De Sterck. Supporting multi-
row distributed transactions with Global Snapshot
Isolation using bare-bones HBase. In Proceedings
of the 11th ACM/IEEE International Conference
on Grid Computing (Grid 2010), pages 295–310,
Brussels, 2010.