=Paper= {{Paper |id=Vol-1594/paper15 |storemode=property |title=Decoupling Persistence Services from DBMS Buffer Management |pdfUrl=https://ceur-ws.org/Vol-1594/paper15.pdf |volume=Vol-1594 |authors=Lucas Lersch,Caetano Sauer,Theo Härder |dblpUrl=https://dblp.org/rec/conf/gvd/LerschSH16 }} ==Decoupling Persistence Services from DBMS Buffer Management== https://ceur-ws.org/Vol-1594/paper15.pdf
                              Decoupling Persistence Services
                              from DBMS Buffer Management

                                      ∗
                  Lucas Lersch                             Caetano Sauer                         Theo Härder
                  TU Kaiserslautern                        TU Kaiserslautern                   TU Kaiserslautern
                      Germany                                  Germany                             Germany
            lucas.lersch@sap.com                       csauer@cs.uni-kl.de                 haerder@cs.uni-kl.de

ABSTRACT                                                               tic cost reduction of volatile memory. Nowadays, it is reali-
Advances in DRAM technology and, in turn, substantial cost             stic to consider a scenario where the buffer pool can accom-
reduction for volatile memory in recent years require an evo-          modate most, if not all, data pages and, consequently, there
lution of database system architectures to take full benefit of        are fewer page reads and writes from/to persistent stora-
large buffer pools. Having huge memory sizes, an up-to-date            ge. Furthermore, the commit latency is drastically reduced,
version of database pages on stable storage is more than ever          thanks to modern solid state drives which offer a much hig-
necessary to support fast and effective crash recovery.                her number of I/O operations per second when compared to
   In this contribution, we consider important components of           hard-disk drives. In this contribution, we reconsider the exi-
a traditional DBMS architecture and related opportunities              sting components of a traditional database architecture and
for optimization in the context of persistence and system              opportunities for optimization in the context of persistence
recovery. We implemented and evaluated novel checkpoin-                and system recovery.
ting and page cleaning algorithms which are based on log                  We assume that even with large amounts of memory and
information rather than on data collected from critical in-            transaction processing hardly requiring I/O operations, it
memory data structures such as buffer pool and transaction             is still desirable to keep an up-to-date version of database
manager. Decoupling such persistence-related components                pages in persistent storage to guarantee efficient recovery
from in-memory processing enables a simpler and more mo-               in case of a system failure. Most modern database systems
dular DBMS architecture as well as less interference with              rely on write-ahead logging techniques and implement the
these components in critical in-memory data structures.                ARIES [6] algorithm for system recovery. In such environ-
                                                                       ments, regular checkpoints are taken to improve recovery
                                                                       performance. Checkpoints serve as the starting point for log
Keywords                                                               analysis, the first phase of crash recovery. The more recent
database architecture, transaction processing, recovery                a checkpoint was taken, the less time log analysis takes to
                                                                       complete in case of a failure.
1.   INTRODUCTION                                                         In addition to checkpoints, it is also common for data-
                                                                       base systems to periodically clean dirty pages (pages with
  The architecture of traditional database systems evolved
                                                                       committed updates so far not propagated) by flushing them
in a time when the available amount of main memory was
                                                                       from the in-memory buffer pool to persistent storage. Conse-
relatively small compared to the size of typical applicati-
                                                                       quently, since the REDO phase (after a crash) would require
on working sets. To achieve satisfactory transaction perfor-
                                                                       random reads, its cost is reduced by maintaining only a low
mance, these systems had to be heavily optimized for the
                                                                       count of dirty pages [7]. In this work, we refer to both check-
underlying hardware, i. e., the design had to consider fre-
                                                                       pointing and page cleaning in a general way as propagation
quent operations to persistent storage, taking into account
                                                                       services, in the sense that they propagate information from
high latencies of both commit operations and data accesses.
                                                                       the main memory to persistent storage.
Due to current advances in hardware technology, however,
                                                                          The problem is that such propagation services interfere
an adaptation of existing DBMS architectures is required to
                                                                       with and might disturb in-memory transaction processing,
unleash their inherent performance potential [11].
                                                                       since they must inspect data structures and consequently
  Compared to the situation in the 1970s and 1980s, one of
                                                                       acquire and release latches on them. This is especially true
the most important hardware-related changes is the drama-
                                                                       for scenarios of large buffer pools capable of holding the
∗Currently with TU Dresden & SAP SE.                                   entire application working set, since disk I/O is less likely
                                                                       to be the bottleneck. Furthermore, decoupling unnecessarily
                                                                       coupled components in a system is not only a matter of per-
                                                                       formance, but it also offers a cleaner, more modular system
                                                                       architecture and reduces code complexity.
                                                                          The first contribution of the proposed architecture is to
                                                                       enable checkpoints to gather all required information solely
                                                                       by inspecting the log. As a main advantage, it is not on-
28th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
                                                                       ly simpler by making use of the same existing logic of log
banken), 24.05.2016 - 27.05.2016, Nörten-Hardenberg, Germany.          analysis during recovery, but it also implies less interference
Copyright is held by the author/owner(s).




                                                                  80
         Decoupling Persistence Services from DBMS Buffer Management




                   (a) Tightly coupled persistence                                (b) Decoupled persistence

                             Figure 1: Differences between traditional and proposed architecture.


with in-memory processing, as mentioned above. The second          of dirty pages is performed page by page instead of using a
contribution is a decoupled implementation of a page clea-         log scan. To that end, each log record must include a pointer
ner, which also makes use of log data instead of in-memory         (i.e., the LSN) of the last log record affecting a specific pa-
information. Besides also reducing the interference in the         ge. Such a pointer is simply copied from the page LSN field
buffer pool, it can be combined with single-page recovery          prior to generating a log record.
techniques [3] that enable a set of interesting features for          For on-demand UNDO, loser transactions are rolled back
the recovery component of a database system, as discussed          concurrently to REDO and newly running transactions. This
later. Figure 1 illustrates differences between the traditional    rollback is triggered by lock conflicts between new and lo-
design and the proposed decoupled design.                          ser transactions. Therefore, log analysis requires a lock re-
                                                                   acquisition for loser transactions, which implies that currently-
                                                                   held locks must also be included in the information collec-
2.    BACKGROUND                                                   ted for checkpoints – this is in contrast to the ARIES me-
                                                                   thod which re-acquires such locks in the REDO phase. For a
2.1     System Recovery                                            traditional checkpoint procedure inspecting in-memory data
                                                                   structures, this effort causes additional interference, which
2.1.1    ARIES Restart                                             is eliminated in our decoupled design.
   To offer transaction atomicity and durability guarantees,          For further details, we refer to an extensive publication on
i.e., the “A” and “D” of ACID, most modern database sy-            instant recovery techniques [2].
stems implement a write-ahead logging mechanism (WAL).
The ARIES algorithm [6] embodies an efficient way to enable        2.2    Checkpoints
WAL-based system recovery. In case of a system crash, the             In a general way, a checkpoint could be defined as a rela-
recovery goal is to re-establish the most recent transaction-      tively recent persistent copy of any information that serves
consistent database state.                                         as a basis for restart and enables a faster recovery process.
   The system recovery algorithm operates in three different       The more recent a checkpoint was taken w.r.t. the time of
phases: log analysis, REDO, and UNDO. Starting from the            the system failure, the less work is required for recovering
last checkpoint, log analysis scans the log to inspect all log     the system. Thus, it is a good practice to take checkpoints re-
records to determine which pages were dirty and which tran-        gularly. To be more precise in the definition of a checkpoint,
sactions were active at the time of failure. After having ana-     we follow the definition of fuzzy checkpoints as presented in
lyzed the log, the REDO phase scans the relevant part of the       [4]. Therefore, a checkpoint comprises relevant server state
log and replays all log records that refer to updates of dirty     information such as dirty pages, active transactions, and –
pages. Finally, the UNDO phase rolls back transactions that        for the support of instant restart – all acquired locks. The
were active at the time of failure.                                checkpoint is not concerned with the database state and,
   As observed in a previous analysis of typical OLTP work-        therefore, it does not write any pages from the buffer pool
loads [7], UNDO is usually very short, and the determinant         – as discussed later, this concern is delegated to the page
factor for recovery cost is the number of dirty pages that         cleaner service.
require log replay in the REDO phase. Therefore, it is cru-           To take a checkpoint, a BEGIN CHKPT log record is
cial to clean pages proactively as frequently and efficient-       written to the log. Then, all the required information is ga-
ly as possible. Furthermore, it is desirable to take frequent      thered from the in-memory data structures and written to
checkpoints to reduce the time required for log analysis. In a     the log as so-called checkpoint log records. An END CHKPT
traditional design with tightly-coupled propagation services,      log record is inserted to indicate that the checkpoint com-
however, an increased frequency of page cleaning and check-        pleted correctly. The information comprised by a checkpoint
points inevitably hurts performance, as shown in Section 4.        summarizes, for the purposes of restoring the server state, all
                                                                   the log records up to the point where the checkpoint began.
2.1.2    Instant Restart                                              After a crash and during system restart, the log analysis
  An alternative design for system recovery is the instant re-     phase retrieves the most recent checkpoint and starts a for-
start mechanism [2], which opens the system for new tran-          ward scan from there up to the end of the log. During the
sactions immediately after the log analysis phase, perfor-         scan, all log records are analyzed to update the information
ming the required REDO and UNDO actions on demand.                 collected by the last completed checkpoint. As a result, this
The design is implemented with moderate incremental chan-          log analysis delivers the relevant restart information for the
ges to the traditional ARIES algorithm. On-demand REDO             state exactly before the crash occurred.




                                                              81
         Decoupling Persistence Services from DBMS Buffer Management



2.3    Page Cleaning                                                   in a more convenient way. To enable single-pass restore of
   The REDO phase is responsible for replaying all log re-             a storage device after a media failure [8], when moving the
cords that refer to pages marked as dirty in the buffer pool           log records to the archive, they are sorted into runs ordered
at the moment of system failure. Therefore, the amount of              by page identifier using an external sort-merge operation.
work to be processed by REDO is directly related to two                As a result, the log archive is said to be partially sorted, gi-
aspects: (1) the amount of dirty pages in the buffer pool at           ven that the primary sort criterion is an LSN range – which
failure time and (2) how old the versions of their persistent          identifies a run – and the secondary sort criterion is the
page copies are (the older it is, the more log records are ex-         page identifier. These runs are merged asynchronously and
pected to be replayed to bring the page to its most recent             incrementally, using a data structure similar to a partitioned
state).                                                                B-tree [1].
   During normal processing, there are three situations in                The partial sort order in the log archive also enables inde-
which pages are flushed from the buffer pool to persistent             xing, meaning that the history of a single page can be acces-
storage. First, if a dirty page is picked for eviction, it is flus-    sed much more efficiently. This was exploited in a proposal
hed to persistent storage to making room for fetching the              for instant restore [9], which enables on-demand restoration
new page. Considering systems with a large buffer pool, the            from a media failure concurrently with running transacti-
process of evicting a dirty page tends to happen less fre-             ons, similar to instant restart. However, such indexing can
quently. Thus, regularly flushing dirty pages at appropriate           also be exploited for efficient REDO operations during rest-
points is beneficial for reducing the REDO time in case of             art after a system failure as well as for single-page recovery
a crash. This corresponds to the second situation. Third, at           in case of an isolated data corruption [3]. In our case, we
normal system shutdown, it is desirable to flush all dirty             exploit the partially sorted log archive for a novel applicati-
pages to avoid any recovery at the next start up.                      on: propagation of updates with a decoupled page cleaner,
   Most modern database systems implement a page cleaner               which is discussed in the next section.
service running as a background thread to handle all these
situations. In other words, the task of flushing a page is             3.    DECOUPLED PERSISTENCE SERVICES
always delegated to the page cleaner service.
   Once it is activated, the page cleaner iterates over the            3.1    Decoupled Checkpoints
buffer pool and inspects the frame descriptor blocks to de-               The idea behind a decoupled checkpoint is to use the sa-
termine whether the pages contained should be cleaned, i.e,            me logic behind log analysis to gather all information nee-
flushed to persistent storage. Pages can be selected accor-            ded for the checkpoint, instead of querying in-memory data
ding to a number of policies, e.g., hottest first or oldest first.     structures for this reason. To take a decoupled checkpoint,
In this work, we assume a naive policy that simply picks all           all log records between the last completed checkpoint and
dirty pages. Pages picked to be cleaned are then copied to a           the BEGIN CHKPT log record referring to the checkpoint
separate buffer; this is done to reduce the time for which a           currently being taken are analyzed. The main motivation he-
shared latch is held on the page – from a synchronous write            re is that, as minimal as it may be, any interference with the
call to a fast memcpy operation. The pages in the cleaning             data structures caused by the process of checkpoint creation
buffer are then sorted by their page identifier to enable se-          can be completely avoided. However, since it is desirable to
quential writes. After the write buffer is flushed, the cleaner        re-use the same logic of log analysis, some important diffe-
must once again attempt to acquire shared latches on the               rences must be considered.
pages just flushed and mark them as clean, in case there are              First, taking a traditional checkpoint is completely inde-
no further modifications on such a page made during the                pendent from information contained in older checkpoints,
cleaning process. In total, each cleaned page is latched three         since it only inspects in-memory data structures. However,
times: to verify its dirty state, to copy it into an in-transit        in the case of a decoupled checkpoint, since the algorithm
buffer, and to update the dirty state.                                 is the same as for log analysis, it must rely on informa-
                                                                       tion present in the checkpoint previously completed. This
2.4    Log Archiving                                                   introduces the limitation that, when taking a new decou-
   In WAL-based database systems, the latency of the log de-           pled checkpoint, the previously completed checkpoint must
vice has a direct impact on transaction throughput. There-             always be contained in the log. Alternatively, the contents
fore, to enable better performance, latency-optimized stable           of such a checkpoint can be cached in main memory during
storage (such as SSDs) for the recovery log is usually em-             normal processing.
ployed. However, compared to storage devices such as hard-                Second, since the decoupled checkpoint inspects only log
disk drives, latency-optimized devices have a much higher              records, page write operations must be logged to determine
capacity/cost ratio and are therefore less cost-effective. Fur-        when a previously dirty page can be considered clean. As
thermore, it is even more expensive to keep old log records            observed in previous studies [7], logging page writes is a
in a latency-optimized device, assuming that most of these             common practice used in existing database software, since
log records are unlikely to be needed. To avoid filling-up the         it enables more effective recovery from a system failure.
temporary log device, the log records are continuously mo-                Third, taking a decoupled checkpoint may introduce ad-
ved into a log archive using a cost- and capacity-optimized,           ditional I/O for reading the temporary log. An approach
long-term stable storage device. The latency of such a log ar-         that eliminates this overhead maintains checkpoint informa-
chive device is not critical for the system performance, since         tion in main memory and continuously updates it by con-
archived log records are usually needed only during recovery           suming log records from the log buffer, which also resides
from a media failure.                                                  in main memory. When a checkpoint is requested, this in-
   Instead of simply moving the log records from the tempo-            formation can then immediately be propagated to the per-
rary log to the log archive, it is possible to re-organize them        sistent log. Additionally, the checkpoint generation process




                                                                  82
         Decoupling Persistence Services from DBMS Buffer Management



can feed from the same input stream used for log archiving.
When these two techniques are combined, no additional I/O
overhead is incurred. For simplicity of implementation, our
evaluation considers a naive approach where checkpoints al-
ways rely on reading the persistent log.
   Fourth, since the process of taking a decoupled checkpoint
is basically the same as log analysis, the longer the elapsed
time from the last checkpoint, the more time is required
for taking a new decoupled checkpoint. Therefore, the new
technique encourages more frequent checkpoints, which also
benefits recovery performance. Since this kind of checkpoint
creation does not interfere with in-memory processing, over-
all system performance should not be affected.

3.2    Decoupled Page Cleaner
   To motivate the use of a decoupled page cleaner, we are
first considering a scenario where the whole working set is
in main memory, later expanding our argument to medium
and small buffer pools. For such a large buffer pool, there                    Figure 2: Decoupled cleaning scenario
are no page misses and therefore no need for page eviction.
Therefore, the only I/O operations to the database device
made by the cleaner service are related to periodically clea-       clean, the decoupled page cleaner, after replaying the log re-
ning dirty pages, which reduces REDO time during restart            cords and flushing a more recent version of pages, still has
after a system failure. Hence, it is desirable for the cleaner      to acquire latches on pages in the buffer pool to mark them
service to be continuously active and provide for as many           as clean, if necessary. Second, when a page flush is required
clean pages as possible in the buffer pool.                         and the decoupled cleaner is activated, it must be guaran-
   To flush pages, the cleaner service requires direct access to    teed that the page is in its most recent version when fetching
the buffer-pool data structure as well as unnecessarily high        it from persistent storage again. This requires some synchro-
latching (i.e., low-level concurrency control) contention. If       nization among the page cleaner and the process of fetching
the page cleaner service is too aggressive, it might generate       pages from disk.
too much interference (mainly with hot pages) and conse-               To overcome these limitations, the decoupled page cleaner
quently harm the transaction activity. Hence, it is highly          can be combined with a method used for single-page reco-
desirable to reduce any interference with in-memory data            very [3]. By employing this kind of recovery, it is possible
structures while enabling the page cleaner to run as aggres-        to determine whether or not a page fetched from persistent
sively as required. Note that this stands at odds with the          storage is up to date. In case the page is outdated, log replay
goal stated above, which means that a tradeoff is required          is triggered to bring it to the most recent state before allowi-
in the traditional design.                                          ng further accesses. This addresses the second problem state
   A decoupled page cleaner eliminates this trade-off by en-        above. A similar idea, called write elision, can be employed
couraging more frequent checkpoints regardless of transac-          to deal with the eviction problem. It relies on single-page
tion activity. This is achieved by asynchronously replaying         recovery to allow evicting a dirty page from the buffer pool
log records on page versions present on persistent storage,         without writing it back to disk. In combination, these tech-
without requiring any access to the page images in the buf-         niques enable full decoupling of persistence services from
fer pool. This is a similar procedure as those performed by         buffer-pool data structures.
single-pass restore and virtual backups [8].                           In a scenario where the working set does not fit into main
   However, assuming a group of sequential pages that are           memory, the advantages of a decoupled cleaner must be
in the cleaner buffer, fetching log records referring to these      reconsidered. If a medium-sized buffer pool is used, such
pages implies random I/O to the recovery log device, which          that memory bandwidth is well utilized for high transac-
should be avoided as far as possible for performance reasons.       tion throughput, the decoupled cleaner is not expected to
Fortunately, as mentioned in Section 2.4, the log archive de-       impact performance in any way – positive or negative. Ho-
vice can be partially sorted by page identifier and indexed,        wever, the advantage of a modular and loosely-coupled ar-
enabling a sequential read of the log records referring to the      chitecture suggests it as a better approach. In the case of a
pages to be cleaned. Furthermore, to avoid repeated work,           small buffer pool, where page I/O clearly becomes the bott-
the decoupled page cleaner keeps track of the LSN up to             leneck, a traditional approach is preferable, since it can more
which pages were cleaned. Every activation of the decou-            efficiently provide propagation solely by page eviction – i.e.,
pled cleaner then increases this LSN value, essentially brin-       without an asynchronous cleaning daemon. This scenario,
ging the persistent database state up to that LSN. Figure 2         however, is quite rare in modern OLTP systems, given the
illustrates the main idea of the decoupled page cleaner based       moderate cost of DRAM.
on log archive.
   Two further issues must also be considered to fully decou-       3.2.1    Further Applications of Decoupled Cleaning
ple a page cleaner from the buffer pool. First, an efficient          Other than decoupling the modules of page cleaning and
eviction policy should give preference for removing clean pa-       buffer management in the system architecture, a decoupled
ges from the buffer to make room for new pages. Therefore,          page cleaner enables interesting recovery features which we
to keep track of which pages are dirty and which pages are          hope to exploit in future work. These are not considered in




                                                               83
         Decoupling Persistence Services from DBMS Buffer Management



our evaluation in Section 4, but are shortly listed here to        benchmark execution.
motivate further applications of our technique.                       Finally, the experiments described here were carried out
   Similarly to write elision, a decoupled cleaner together        on an Intel Xeon X5670 server with 96 GB of 1333 MHz
with single-page recovery also enables read elision [2]. The       DDR3 memory. The system provides dual 6-core CPUs with
idea consists of simply logging insertions and updates to da-      hyper-threading capabilities, which gives a total of 24 hard-
tabase pages without the requirement of having them loaded         ware thread contexts. The operating system is a 64-bit Ubun-
from persistent storage to memory. The decoupled page clea-        tu Linux 12.04 with Kernel version 3.11.0.
ner is then responsible for later replaying the generated log         Figure 4a shows the results after running the benchmark
records to the persistent pages. Both read and write elision       with the log being on an SSD device. Values on the y-axis re-
offer the advantage of avoiding unnecessary I/O operations         present the average transaction throughput in units of thou-
that would increase the transaction response time. In other        sand transactions per second. Values on the x-axis represent
words, the decoupled page cleaner works in synergy with            how frequent a checkpoint request is made in milliseconds.
single-page recovery in the sense that it runs page recove-        Taking checkpoints too frequently increases the interference
ry in the background and alleviates the cost of doing it on        with the in-memory data structures, which may harm the
demand.                                                            transaction throughput of the system. Even though taking
   A decoupled cleaner also allows different page representa-      a checkpoint every millisecond is not a realistic scenario, it
tions for buffered and persistent images of pages, which may       is done in order to stress the system and enforce as much
exhibit several advantages. For instance, in a design that         interference as possible.
maintains only committed log records in the persistent log            The transaction throughput of decoupled checkpoints is
[10], a no-steal policy is easily achieved. In such a design,      not only higher when compared to classical checkpoints,
uncommitted updates always remain in the volatile buffer           but the variation of throughput between different checkpoint
pool, so that the need for UNDO logging & recovery is eli-         frequencies is smaller. Ideally, the throughput of decoupled
minated.                                                           checkpoints should be constant for any checkpoint frequency,
                                                                   since there is no interference with in-memory data structures
4.   EXPERIMENTS                                                   that might disturb the system performance. However, the
                                                                   additional I/O operations required for a decoupled check-
   As the main hypothesis to be tested by the following ex-
                                                                   point to read log information might induce a delay on the
periments, a decoupled design may reduce the interference
                                                                   writing of log records for transaction commit.
with main-memory data structures, which might deteriora-
                                                                      Taking classical checkpoints every interval higher than
te the transaction throughput of the whole system. In our
                                                                   1000 ms does not produce significant interference in the sy-
experiments, we only evaluate the performance of decoupled
                                                                   stem and throughput after this point equals the one achieved
checkpoints. The decoupled cleaner provides similar results,
                                                                   by decoupled checkpoints.
but a more thorough analysis is currently being carried out
                                                                      The CPU consumption for the whole benchmark executi-
and it is thus reserved for a future publication.
                                                                   on was also analyzed. The higher the transaction through-
   We implemented the algorithms previously described for
                                                                   put, the higher is the CPU consumption. In Figure 3, we
creating decoupled checkpoints in our storage manager (ba-
                                                                   compare both designs in the case previously mentioned whe-
sed on Shore-MT [5]). Here, we present experiments compa-
                                                                   re the transaction throughput is the same. Here we can
ring the classical and the decoupled checkpoint implemen-
                                                                   observe that decoupled checkpoints consume a small extra
tation to exemplify how such a service can interfere with
                                                                   amount of CPU compared to classical checkpoints.
in-memory processing. The experiments consist of executing
the TPC-B benchmark, which performs a large amount of
small read-write transactions, such as debit and credit on a                                             16
                                                                           Average CPU utilization (%)




bank account. For each experiment, the database is loaded                                                14
and the benchmark is executed for 15 minutes.                                                            12
   The scale factor of TPC-B is set to meet the exact number
of hardware threads in such a way that each hardware thread                                              10
executes transactions referring to a single branch. Conse-                                                8
quently, there are no concurrency conflicts that might inter-                                             6
fere with the transaction throughput. Furthermore, in order
to simulate modern in-memory database systems, the buffer                                                 4
pool size is set to fit the whole database size after executing                                           2
the benchmark. Database and log archive are each stored                                                   0
in its own latency-optimized device (SSD). The experiments                                                    classical   decoupled
were executed with the recovery log both in a dedicated
latency-optimized device (SSD) and in a main-memory file                Figure 3: CPU consumption of Figure 4a with 10 sec.
system. Having the recovery log in a main-memory file sy-
stem enables a simulation of database systems with higher
transaction throughput, where interferences should have a
larger impact on performance. However, this approach does          5.     SUMMARY AND CONCLUSIONS
not represent a real-world scenario, since a recovery log must       In this contribution, we propose an alternative architec-
always be stored on a non-volatile device.                         ture for database systems by decoupling the propagation
   The experiments with decoupled checkpoints were execu-          components from in-memory processing. Decoupling com-
ted with log archiving disabled. Transaction throughput was        ponents is desirable not only because it avoids any interfe-
measured by inspecting the log records generated during the        rence with in-memory data structures, but also because it




                                                              84
                         Decoupling Persistence Services from DBMS Buffer Management



                    26                                                                             88
                    24                                                                             86
Throughput (ktps)




                                                                               Throughput (ktps)
                    22                                                                             84
                                                                                                   82
                    20
                                                                                                   80
                    18
                                                                                                   78
                    16                                                                             76
                    14                                      decoupled                              74                                  decoupled
                                                              classical                                                                  classical
                    12                                                                             72
                         1        10          100        1000         10000                             1    10          100        1000        10000
                                 Checkpoint frequency (ms)                                                  Checkpoint frequency (ms)
                                      (a) Log in SSD                                                             (b) Log in RAM

                                                                Figure 4: TPC-B throughput.


eliminates much of the code complexity introduced by unne-                                [6] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and
cessary interactions between components. To that end, we                                      P. Schwarz. Aries: A transaction recovery method
implemented novel checkpoint and page cleaner algorithms                                      supporting fine-granularity locking and partial
that are based on log information rather than on data col-                                    rollbacks using write-ahead logging. ACM Trans.
lected from critical in-memory data structures. Based on a                                    Database Syst., 17(1):94–162, Mar. 1992.
qualitative analysis, we verified that the complexity of our                              [7] C. Sauer, G. Graefe, and T. Härder. An empirical
codebase was significantly reduced.                                                           analysis of database recovery costs. In Proc. RDSS
   Our techniques are better suited for scenarios where the                                   Workshop (co-located with SIGMOD), 2014.
majority of the application working set fits into main me-                                [8] C. Sauer, G. Graefe, and T. Härder. Single-pass
mory. This is because decoupled persistence services are as-                                  restore after a media failure. In Proc. BTW, LNI 241,
sumed to have a more proactive and preventive role – rather                                   pages 217–236, 2015.
than being critical components on which transactions de-                                  [9] C. Sauer, G. Graefe, and T. Härder. Instant restore
pend in order to make progress. However, the techniques                                       after a media failure, 2016. Submitted for publication.
are not restricted to in-memory workloads in any way.                                    [10] C. Sauer and T. Härder. A novel recovery mechanism
   The experiments presented in this work show that the                                       enabling fine-granularity locking and fast, redo-only
decoupled strategy for checkpoints improves the transacti-                                    recovery. CoRR, abs/1409.3682, 2014.
on throughput of the system when persistence services are
                                                                                         [11] M. Stonebraker, S. Madden, D. J. Abadi,
executed very frequently. As the frequency is reduced, the
                                                                                              S. Harizopoulos, N. Hachem, and P. Helland. The end
performance of traditional techniques is matched, but the
                                                                                              of an architectural era: (it’s time for a complete
decoupled strategy does not incur any performance penal-
                                                                                              rewrite). In Proc. VLDB, pages 1150–1160, 2007.
ty. For future work, we plan to evaluate our decoupled page
cleaner in more detail, including an analysis of different clea-
ning policies for both classical and decoupled services. We
believe that a combination of the techniques can achieve the
most effectiveness in reducing recovery time, which is the
main goal of page cleaning.

6.                  REFERENCES
   [1] G. Graefe. Sorting and indexing with partitioned
       b-trees. In Proc. CIDR, pages 5–8, 2003.
   [2] G. Graefe, W. Guy, and C. Sauer. Instant recovery
       with write-ahead logging: page repair, system restart,
       and media restore. Synthesis Lectures on Data
       Management. Morgan & Claypool Publ., 2014.
   [3] G. Graefe and H. A. Kuno. Definition, detection, and
       recovery of single-page failures, a fourth class of
       database failures. PVLDB, 5(7):646–655, 2012.
   [4] T. Härder and A. Reuter. Principles of
       transaction-oriented database recovery. ACM Comput.
       Surv., 15(4):287–317, Dec. 1983.
   [5] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki,
       and B. Falsafi. Shore-mt: A scalable storage manager
       for the multicore era. In Proc. EDBT, pages 24–35,
       2009.




                                                                          85