<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Decoupling Persistence Services from DBMS Buffer Management</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lucas Lersch</string-name>
          <email>lucas.lersch@sap.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Caetano Sauer</string-name>
          <email>csauer@cs.uni-kl.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Theo Härder</string-name>
          <email>haerder@cs.uni-kl.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Kaiserslautern</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>80</fpage>
      <lpage>85</lpage>
      <abstract>
        <p>Advances in DRAM technology and, in turn, substantial cost reduction for volatile memory in recent years require an evolution of database system architectures to take full benefit of large buffer pools. Having huge memory sizes, an up-to-date version of database pages on stable storage is more than ever necessary to support fast and effective crash recovery. In this contribution, we consider important components of a traditional DBMS architecture and related opportunities for optimization in the context of persistence and system recovery. We implemented and evaluated novel checkpointing and page cleaning algorithms which are based on log information rather than on data collected from critical inmemory data structures such as buffer pool and transaction manager. Decoupling such persistence-related components from in-memory processing enables a simpler and more modular DBMS architecture as well as less interference with these components in critical in-memory data structures.</p>
      </abstract>
      <kwd-group>
        <kwd>database architecture</kwd>
        <kwd>transaction processing</kwd>
        <kwd>recovery</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The architecture of traditional database systems evolved
in a time when the available amount of main memory was
relatively small compared to the size of typical
application working sets. To achieve satisfactory transaction
performance, these systems had to be heavily optimized for the
underlying hardware, i. e., the design had to consider
frequent operations to persistent storage, taking into account
high latencies of both commit operations and data accesses.
Due to current advances in hardware technology, however,
an adaptation of existing DBMS architectures is required to
unleash their inherent performance potential [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Compared to the situation in the 1970s and 1980s, one of
the most important hardware-related changes is the
drama∗Currently with TU Dresden &amp; SAP SE.
tic cost reduction of volatile memory. Nowadays, it is
realistic to consider a scenario where the buffer pool can
accommodate most, if not all, data pages and, consequently, there
are fewer page reads and writes from/to persistent
storage. Furthermore, the commit latency is drastically reduced,
thanks to modern solid state drives which offer a much
higher number of I/O operations per second when compared to
hard-disk drives. In this contribution, we reconsider the
existing components of a traditional database architecture and
opportunities for optimization in the context of persistence
and system recovery.</p>
      <p>
        We assume that even with large amounts of memory and
transaction processing hardly requiring I/O operations, it
is still desirable to keep an up-to-date version of database
pages in persistent storage to guarantee efficient recovery
in case of a system failure. Most modern database systems
rely on write-ahead logging techniques and implement the
ARIES [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] algorithm for system recovery. In such
environments, regular checkpoints are taken to improve recovery
performance. Checkpoints serve as the starting point for log
analysis, the first phase of crash recovery. The more recent
a checkpoint was taken, the less time log analysis takes to
complete in case of a failure.
      </p>
      <p>
        In addition to checkpoints, it is also common for
database systems to periodically clean dirty pages (pages with
committed updates so far not propagated) by flushing them
from the in-memory buffer pool to persistent storage.
Consequently, since the REDO phase (after a crash) would require
random reads, its cost is reduced by maintaining only a low
count of dirty pages [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In this work, we refer to both
checkpointing and page cleaning in a general way as propagation
services, in the sense that they propagate information from
the main memory to persistent storage.
      </p>
      <p>The problem is that such propagation services interfere
with and might disturb in-memory transaction processing,
since they must inspect data structures and consequently
acquire and release latches on them. This is especially true
for scenarios of large buffer pools capable of holding the
entire application working set, since disk I/O is less likely
to be the bottleneck. Furthermore, decoupling unnecessarily
coupled components in a system is not only a matter of
performance, but it also offers a cleaner, more modular system
architecture and reduces code complexity.</p>
      <p>
        The first contribution of the proposed architecture is to
enable checkpoints to gather all required information solely
by inspecting the log. As a main advantage, it is not
only simpler by making use of the same existing logic of log
analysis during recovery, but it also implies less interference
(a) Tightly coupled persistence
(b) Decoupled persistence
with in-memory processing, as mentioned above. The second
contribution is a decoupled implementation of a page
cleaner, which also makes use of log data instead of in-memory
information. Besides also reducing the interference in the
buffer pool, it can be combined with single-page recovery
techniques [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that enable a set of interesting features for
the recovery component of a database system, as discussed
later. Figure 1 illustrates differences between the traditional
design and the proposed decoupled design.
      </p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
    </sec>
    <sec id="sec-3">
      <title>System Recovery</title>
      <p>2.1.1</p>
      <sec id="sec-3-1">
        <title>ARIES Restart</title>
        <p>
          To offer transaction atomicity and durability guarantees,
i.e., the “A” and “D” of ACID, most modern database
systems implement a write-ahead logging mechanism (WAL).
The ARIES algorithm [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] embodies an efficient way to enable
WAL-based system recovery. In case of a system crash, the
recovery goal is to re-establish the most recent
transactionconsistent database state.
        </p>
        <p>The system recovery algorithm operates in three different
phases: log analysis, REDO, and UNDO. Starting from the
last checkpoint, log analysis scans the log to inspect all log
records to determine which pages were dirty and which
transactions were active at the time of failure. After having
analyzed the log, the REDO phase scans the relevant part of the
log and replays all log records that refer to updates of dirty
pages. Finally, the UNDO phase rolls back transactions that
were active at the time of failure.</p>
        <p>
          As observed in a previous analysis of typical OLTP
workloads [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], UNDO is usually very short, and the determinant
factor for recovery cost is the number of dirty pages that
require log replay in the REDO phase. Therefore, it is
crucial to clean pages proactively as frequently and
efficiently as possible. Furthermore, it is desirable to take frequent
checkpoints to reduce the time required for log analysis. In a
traditional design with tightly-coupled propagation services,
however, an increased frequency of page cleaning and
checkpoints inevitably hurts performance, as shown in Section 4.
2.1.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Instant Restart</title>
        <p>
          An alternative design for system recovery is the instant
restart mechanism [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], which opens the system for new
transactions immediately after the log analysis phase,
performing the required REDO and UNDO actions on demand.
The design is implemented with moderate incremental
changes to the traditional ARIES algorithm. On-demand REDO
of dirty pages is performed page by page instead of using a
log scan. To that end, each log record must include a pointer
(i.e., the LSN) of the last log record affecting a specific
page. Such a pointer is simply copied from the page LSN field
prior to generating a log record.
        </p>
        <p>For on-demand UNDO, loser transactions are rolled back
concurrently to REDO and newly running transactions. This
rollback is triggered by lock conflicts between new and
loser transactions. Therefore, log analysis requires a lock
reacquisition for loser transactions, which implies that
currentlyheld locks must also be included in the information
collected for checkpoints – this is in contrast to the ARIES
method which re-acquires such locks in the REDO phase. For a
traditional checkpoint procedure inspecting in-memory data
structures, this effort causes additional interference, which
is eliminated in our decoupled design.</p>
        <p>
          For further details, we refer to an extensive publication on
instant recovery techniques [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
2.2
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Checkpoints</title>
      <p>
        In a general way, a checkpoint could be defined as a
relatively recent persistent copy of any information that serves
as a basis for restart and enables a faster recovery process.
The more recent a checkpoint was taken w.r.t. the time of
the system failure, the less work is required for recovering
the system. Thus, it is a good practice to take checkpoints
regularly. To be more precise in the definition of a checkpoint,
we follow the definition of fuzzy checkpoints as presented in
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Therefore, a checkpoint comprises relevant server state
information such as dirty pages, active transactions, and –
for the support of instant restart – all acquired locks. The
checkpoint is not concerned with the database state and,
therefore, it does not write any pages from the buffer pool
– as discussed later, this concern is delegated to the page
cleaner service.
      </p>
      <p>To take a checkpoint, a BEGIN CHKPT log record is
written to the log. Then, all the required information is
gathered from the in-memory data structures and written to
the log as so-called checkpoint log records. An END CHKPT
log record is inserted to indicate that the checkpoint
completed correctly. The information comprised by a checkpoint
summarizes, for the purposes of restoring the server state, all
the log records up to the point where the checkpoint began.</p>
      <p>After a crash and during system restart, the log analysis
phase retrieves the most recent checkpoint and starts a
forward scan from there up to the end of the log. During the
scan, all log records are analyzed to update the information
collected by the last completed checkpoint. As a result, this
log analysis delivers the relevant restart information for the
state exactly before the crash occurred.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Page Cleaning</title>
      <p>The REDO phase is responsible for replaying all log
records that refer to pages marked as dirty in the buffer pool
at the moment of system failure. Therefore, the amount of
work to be processed by REDO is directly related to two
aspects: (1) the amount of dirty pages in the buffer pool at
failure time and (2) how old the versions of their persistent
page copies are (the older it is, the more log records are
expected to be replayed to bring the page to its most recent
state).</p>
      <p>During normal processing, there are three situations in
which pages are flushed from the buffer pool to persistent
storage. First, if a dirty page is picked for eviction, it is
flushed to persistent storage to making room for fetching the
new page. Considering systems with a large buffer pool, the
process of evicting a dirty page tends to happen less
frequently. Thus, regularly flushing dirty pages at appropriate
points is beneficial for reducing the REDO time in case of
a crash. This corresponds to the second situation. Third, at
normal system shutdown, it is desirable to flush all dirty
pages to avoid any recovery at the next start up.</p>
      <p>Most modern database systems implement a page cleaner
service running as a background thread to handle all these
situations. In other words, the task of flushing a page is
always delegated to the page cleaner service.</p>
      <p>Once it is activated, the page cleaner iterates over the
buffer pool and inspects the frame descriptor blocks to
determine whether the pages contained should be cleaned, i.e,
flushed to persistent storage. Pages can be selected
according to a number of policies, e.g., hottest first or oldest first.
In this work, we assume a naive policy that simply picks all
dirty pages. Pages picked to be cleaned are then copied to a
separate buffer; this is done to reduce the time for which a
shared latch is held on the page – from a synchronous write
call to a fast memcpy operation. The pages in the cleaning
buffer are then sorted by their page identifier to enable
sequential writes. After the write buffer is flushed, the cleaner
must once again attempt to acquire shared latches on the
pages just flushed and mark them as clean, in case there are
no further modifications on such a page made during the
cleaning process. In total, each cleaned page is latched three
times: to verify its dirty state, to copy it into an in-transit
buffer, and to update the dirty state.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Log Archiving</title>
      <p>In WAL-based database systems, the latency of the log
device has a direct impact on transaction throughput.
Therefore, to enable better performance, latency-optimized stable
storage (such as SSDs) for the recovery log is usually
employed. However, compared to storage devices such as
harddisk drives, latency-optimized devices have a much higher
capacity/cost ratio and are therefore less cost-effective.
Furthermore, it is even more expensive to keep old log records
in a latency-optimized device, assuming that most of these
log records are unlikely to be needed. To avoid filling-up the
temporary log device, the log records are continuously
moved into a log archive using a cost- and capacity-optimized,
long-term stable storage device. The latency of such a log
archive device is not critical for the system performance, since
archived log records are usually needed only during recovery
from a media failure.</p>
      <p>
        Instead of simply moving the log records from the
temporary log to the log archive, it is possible to re-organize them
in a more convenient way. To enable single-pass restore of
a storage device after a media failure [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], when moving the
log records to the archive, they are sorted into runs ordered
by page identifier using an external sort-merge operation.
As a result, the log archive is said to be partially sorted,
given that the primary sort criterion is an LSN range – which
identifies a run – and the secondary sort criterion is the
page identifier. These runs are merged asynchronously and
incrementally, using a data structure similar to a partitioned
B-tree [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The partial sort order in the log archive also enables
indexing, meaning that the history of a single page can be
accessed much more efficiently. This was exploited in a proposal
for instant restore [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which enables on-demand restoration
from a media failure concurrently with running
transactions, similar to instant restart. However, such indexing can
also be exploited for efficient REDO operations during
restart after a system failure as well as for single-page recovery
in case of an isolated data corruption [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In our case, we
exploit the partially sorted log archive for a novel
application: propagation of updates with a decoupled page cleaner,
which is discussed in the next section.
3.
3.1
      </p>
    </sec>
    <sec id="sec-7">
      <title>DECOUPLED PERSISTENCE SERVICES</title>
    </sec>
    <sec id="sec-8">
      <title>Decoupled Checkpoints</title>
      <p>The idea behind a decoupled checkpoint is to use the
same logic behind log analysis to gather all information
needed for the checkpoint, instead of querying in-memory data
structures for this reason. To take a decoupled checkpoint,
all log records between the last completed checkpoint and
the BEGIN CHKPT log record referring to the checkpoint
currently being taken are analyzed. The main motivation
here is that, as minimal as it may be, any interference with the
data structures caused by the process of checkpoint creation
can be completely avoided. However, since it is desirable to
re-use the same logic of log analysis, some important
differences must be considered.</p>
      <p>First, taking a traditional checkpoint is completely
independent from information contained in older checkpoints,
since it only inspects in-memory data structures. However,
in the case of a decoupled checkpoint, since the algorithm
is the same as for log analysis, it must rely on
information present in the checkpoint previously completed. This
introduces the limitation that, when taking a new
decoupled checkpoint, the previously completed checkpoint must
always be contained in the log. Alternatively, the contents
of such a checkpoint can be cached in main memory during
normal processing.</p>
      <p>
        Second, since the decoupled checkpoint inspects only log
records, page write operations must be logged to determine
when a previously dirty page can be considered clean. As
observed in previous studies [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], logging page writes is a
common practice used in existing database software, since
it enables more effective recovery from a system failure.
      </p>
      <p>Third, taking a decoupled checkpoint may introduce
additional I/O for reading the temporary log. An approach
that eliminates this overhead maintains checkpoint
information in main memory and continuously updates it by
consuming log records from the log buffer, which also resides
in main memory. When a checkpoint is requested, this
information can then immediately be propagated to the
persistent log. Additionally, the checkpoint generation process
can feed from the same input stream used for log archiving.
When these two techniques are combined, no additional I/O
overhead is incurred. For simplicity of implementation, our
evaluation considers a naive approach where checkpoints
always rely on reading the persistent log.</p>
      <p>Fourth, since the process of taking a decoupled checkpoint
is basically the same as log analysis, the longer the elapsed
time from the last checkpoint, the more time is required
for taking a new decoupled checkpoint. Therefore, the new
technique encourages more frequent checkpoints, which also
benefits recovery performance. Since this kind of checkpoint
creation does not interfere with in-memory processing,
overall system performance should not be affected.
3.2</p>
    </sec>
    <sec id="sec-9">
      <title>Decoupled Page Cleaner</title>
      <p>To motivate the use of a decoupled page cleaner, we are
first considering a scenario where the whole working set is
in main memory, later expanding our argument to medium
and small buffer pools. For such a large buffer pool, there
are no page misses and therefore no need for page eviction.
Therefore, the only I/O operations to the database device
made by the cleaner service are related to periodically
cleaning dirty pages, which reduces REDO time during restart
after a system failure. Hence, it is desirable for the cleaner
service to be continuously active and provide for as many
clean pages as possible in the buffer pool.</p>
      <p>To flush pages, the cleaner service requires direct access to
the buffer-pool data structure as well as unnecessarily high
latching (i.e., low-level concurrency control) contention. If
the page cleaner service is too aggressive, it might generate
too much interference (mainly with hot pages) and
consequently harm the transaction activity. Hence, it is highly
desirable to reduce any interference with in-memory data
structures while enabling the page cleaner to run as
aggressively as required. Note that this stands at odds with the
goal stated above, which means that a tradeoff is required
in the traditional design.</p>
      <p>
        A decoupled page cleaner eliminates this trade-off by
encouraging more frequent checkpoints regardless of
transaction activity. This is achieved by asynchronously replaying
log records on page versions present on persistent storage,
without requiring any access to the page images in the
buffer pool. This is a similar procedure as those performed by
single-pass restore and virtual backups [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>However, assuming a group of sequential pages that are
in the cleaner buffer, fetching log records referring to these
pages implies random I/O to the recovery log device, which
should be avoided as far as possible for performance reasons.
Fortunately, as mentioned in Section 2.4, the log archive
device can be partially sorted by page identifier and indexed,
enabling a sequential read of the log records referring to the
pages to be cleaned. Furthermore, to avoid repeated work,
the decoupled page cleaner keeps track of the LSN up to
which pages were cleaned. Every activation of the
decoupled cleaner then increases this LSN value, essentially
bringing the persistent database state up to that LSN. Figure 2
illustrates the main idea of the decoupled page cleaner based
on log archive.</p>
      <p>Two further issues must also be considered to fully
decouple a page cleaner from the buffer pool. First, an efficient
eviction policy should give preference for removing clean
pages from the buffer to make room for new pages. Therefore,
to keep track of which pages are dirty and which pages are
clean, the decoupled page cleaner, after replaying the log
records and flushing a more recent version of pages, still has
to acquire latches on pages in the buffer pool to mark them
as clean, if necessary. Second, when a page flush is required
and the decoupled cleaner is activated, it must be
guaranteed that the page is in its most recent version when fetching
it from persistent storage again. This requires some
synchronization among the page cleaner and the process of fetching
pages from disk.</p>
      <p>
        To overcome these limitations, the decoupled page cleaner
can be combined with a method used for single-page
recovery [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. By employing this kind of recovery, it is possible
to determine whether or not a page fetched from persistent
storage is up to date. In case the page is outdated, log replay
is triggered to bring it to the most recent state before
allowing further accesses. This addresses the second problem state
above. A similar idea, called write elision, can be employed
to deal with the eviction problem. It relies on single-page
recovery to allow evicting a dirty page from the buffer pool
without writing it back to disk. In combination, these
techniques enable full decoupling of persistence services from
buffer-pool data structures.
      </p>
      <p>In a scenario where the working set does not fit into main
memory, the advantages of a decoupled cleaner must be
reconsidered. If a medium-sized buffer pool is used, such
that memory bandwidth is well utilized for high
transaction throughput, the decoupled cleaner is not expected to
impact performance in any way – positive or negative.
However, the advantage of a modular and loosely-coupled
architecture suggests it as a better approach. In the case of a
small buffer pool, where page I/O clearly becomes the
bottleneck, a traditional approach is preferable, since it can more
efficiently provide propagation solely by page eviction – i.e.,
without an asynchronous cleaning daemon. This scenario,
however, is quite rare in modern OLTP systems, given the
moderate cost of DRAM.
3.2.1</p>
      <sec id="sec-9-1">
        <title>Further Applications of Decoupled Cleaning</title>
        <p>Other than decoupling the modules of page cleaning and
buffer management in the system architecture, a decoupled
page cleaner enables interesting recovery features which we
hope to exploit in future work. These are not considered in
our evaluation in Section 4, but are shortly listed here to
motivate further applications of our technique.</p>
        <p>
          Similarly to write elision, a decoupled cleaner together
with single-page recovery also enables read elision [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The
idea consists of simply logging insertions and updates to
database pages without the requirement of having them loaded
from persistent storage to memory. The decoupled page
cleaner is then responsible for later replaying the generated log
records to the persistent pages. Both read and write elision
offer the advantage of avoiding unnecessary I/O operations
that would increase the transaction response time. In other
words, the decoupled page cleaner works in synergy with
single-page recovery in the sense that it runs page
recovery in the background and alleviates the cost of doing it on
demand.
        </p>
        <p>
          A decoupled cleaner also allows different page
representations for buffered and persistent images of pages, which may
exhibit several advantages. For instance, in a design that
maintains only committed log records in the persistent log
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], a no-steal policy is easily achieved. In such a design,
uncommitted updates always remain in the volatile buffer
pool, so that the need for UNDO logging &amp; recovery is
eliminated.
        </p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>EXPERIMENTS</title>
      <p>As the main hypothesis to be tested by the following
experiments, a decoupled design may reduce the interference
with main-memory data structures, which might
deteriorate the transaction throughput of the whole system. In our
experiments, we only evaluate the performance of decoupled
checkpoints. The decoupled cleaner provides similar results,
but a more thorough analysis is currently being carried out
and it is thus reserved for a future publication.</p>
      <p>
        We implemented the algorithms previously described for
creating decoupled checkpoints in our storage manager
(based on Shore-MT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). Here, we present experiments
comparing the classical and the decoupled checkpoint
implementation to exemplify how such a service can interfere with
in-memory processing. The experiments consist of executing
the TPC-B benchmark, which performs a large amount of
small read-write transactions, such as debit and credit on a
bank account. For each experiment, the database is loaded
and the benchmark is executed for 15 minutes.
      </p>
      <p>The scale factor of TPC-B is set to meet the exact number
of hardware threads in such a way that each hardware thread
executes transactions referring to a single branch.
Consequently, there are no concurrency conflicts that might
interfere with the transaction throughput. Furthermore, in order
to simulate modern in-memory database systems, the buffer
pool size is set to fit the whole database size after executing
the benchmark. Database and log archive are each stored
in its own latency-optimized device (SSD). The experiments
were executed with the recovery log both in a dedicated
latency-optimized device (SSD) and in a main-memory file
system. Having the recovery log in a main-memory file
system enables a simulation of database systems with higher
transaction throughput, where interferences should have a
larger impact on performance. However, this approach does
not represent a real-world scenario, since a recovery log must
always be stored on a non-volatile device.</p>
      <p>The experiments with decoupled checkpoints were
executed with log archiving disabled. Transaction throughput was
measured by inspecting the log records generated during the
benchmark execution.</p>
      <p>Finally, the experiments described here were carried out
on an Intel Xeon X5670 server with 96 GB of 1333 MHz
DDR3 memory. The system provides dual 6-core CPUs with
hyper-threading capabilities, which gives a total of 24
hardware thread contexts. The operating system is a 64-bit
Ubuntu Linux 12.04 with Kernel version 3.11.0.</p>
      <p>Figure 4a shows the results after running the benchmark
with the log being on an SSD device. Values on the y-axis
represent the average transaction throughput in units of
thousand transactions per second. Values on the x-axis represent
how frequent a checkpoint request is made in milliseconds.
Taking checkpoints too frequently increases the interference
with the in-memory data structures, which may harm the
transaction throughput of the system. Even though taking
a checkpoint every millisecond is not a realistic scenario, it
is done in order to stress the system and enforce as much
interference as possible.</p>
      <p>The transaction throughput of decoupled checkpoints is
not only higher when compared to classical checkpoints,
but the variation of throughput between different checkpoint
frequencies is smaller. Ideally, the throughput of decoupled
checkpoints should be constant for any checkpoint frequency,
since there is no interference with in-memory data structures
that might disturb the system performance. However, the
additional I/O operations required for a decoupled
checkpoint to read log information might induce a delay on the
writing of log records for transaction commit.</p>
      <p>Taking classical checkpoints every interval higher than
1000 ms does not produce significant interference in the
system and throughput after this point equals the one achieved
by decoupled checkpoints.</p>
      <p>The CPU consumption for the whole benchmark
execution was also analyzed. The higher the transaction
throughput, the higher is the CPU consumption. In Figure 3, we
compare both designs in the case previously mentioned
where the transaction throughput is the same. Here we can
observe that decoupled checkpoints consume a small extra
amount of CPU compared to classical checkpoints.
)16
(%14
n
ito12
a
ilz10
i
tu 8
PU 6
C
e 4
g
rea 2
v
A 0
classical
decoupled</p>
    </sec>
    <sec id="sec-11">
      <title>5. SUMMARY AND CONCLUSIONS</title>
      <p>In this contribution, we propose an alternative
architecture for database systems by decoupling the propagation
components from in-memory processing. Decoupling
components is desirable not only because it avoids any
interference with in-memory data structures, but also because it
26
12
1
88
86
72
1
10 100 1000
Checkpoint frequency (ms)
(a) Log in SSD
decoupled
classical</p>
      <p>10000
(b) Log in RAM
decoupled
classical
10000
eliminates much of the code complexity introduced by
unnecessary interactions between components. To that end, we
implemented novel checkpoint and page cleaner algorithms
that are based on log information rather than on data
collected from critical in-memory data structures. Based on a
qualitative analysis, we verified that the complexity of our
codebase was significantly reduced.</p>
      <p>Our techniques are better suited for scenarios where the
majority of the application working set fits into main
memory. This is because decoupled persistence services are
assumed to have a more proactive and preventive role – rather
than being critical components on which transactions
depend in order to make progress. However, the techniques
are not restricted to in-memory workloads in any way.</p>
      <p>The experiments presented in this work show that the
decoupled strategy for checkpoints improves the
transaction throughput of the system when persistence services are
executed very frequently. As the frequency is reduced, the
performance of traditional techniques is matched, but the
decoupled strategy does not incur any performance
penalty. For future work, we plan to evaluate our decoupled page
cleaner in more detail, including an analysis of different
cleaning policies for both classical and decoupled services. We
believe that a combination of the techniques can achieve the
most effectiveness in reducing recovery time, which is the
main goal of page cleaning.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Graefe</surname>
          </string-name>
          .
          <article-title>Sorting and indexing with partitioned b-trees</article-title>
          .
          <source>In Proc. CIDR</source>
          , pages
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Graefe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Guy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Sauer</surname>
          </string-name>
          .
          <article-title>Instant recovery with write-ahead logging: page repair, system restart, and media restore</article-title>
          .
          <source>Synthesis Lectures on Data Management</source>
          . Morgan &amp; Claypool Publ.,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Graefe</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Kuno</surname>
          </string-name>
          . Definition, detection, and
          <article-title>recovery of single-page failures, a fourth class of database failures</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>5</volume>
          (
          <issue>7</issue>
          ):
          <fpage>646</fpage>
          -
          <lpage>655</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ha</surname>
          </string-name>
          <article-title>¨rder and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Reuter</surname>
          </string-name>
          .
          <article-title>Principles of transaction-oriented database recovery</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>15</volume>
          (
          <issue>4</issue>
          ):
          <fpage>287</fpage>
          -
          <lpage>317</lpage>
          , Dec.
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , I. Pandis,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hardavellas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Falsafi</surname>
          </string-name>
          .
          <article-title>Shore-mt: A scalable storage manager for the multicore era</article-title>
          .
          <source>In Proc. EDBT</source>
          , pages
          <fpage>24</fpage>
          -
          <lpage>35</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Mohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Haderle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lindsay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pirahesh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwarz. Aries</surname>
          </string-name>
          :
          <article-title>A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          .,
          <volume>17</volume>
          (
          <issue>1</issue>
          ):
          <fpage>94</fpage>
          -
          <lpage>162</lpage>
          , Mar.
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sauer</surname>
          </string-name>
          , G. Graefe, and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ha</surname>
          </string-name>
          <article-title>¨rder. An empirical analysis of database recovery costs</article-title>
          .
          <source>In Proc. RDSS Workshop (co-located with SIGMOD)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sauer</surname>
          </string-name>
          , G. Graefe, and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ha</surname>
          </string-name>
          <article-title>¨rder. Single-pass restore after a media failure</article-title>
          .
          <source>In Proc. BTW</source>
          , LNI
          <volume>241</volume>
          , pages
          <fpage>217</fpage>
          -
          <lpage>236</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sauer</surname>
          </string-name>
          , G. Graefe, and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ha</surname>
          </string-name>
          <article-title>¨rder. Instant restore after a media failure, 2016. Submitted for publication</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sauer</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ha</surname>
          </string-name>
          <article-title>¨rder. A novel recovery mechanism enabling fine-granularity locking and fast, redo-only recovery</article-title>
          .
          <source>CoRR, abs/1409.3682</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Harizopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hachem</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Helland</surname>
          </string-name>
          .
          <article-title>The end of an architectural era: (it's time for a complete rewrite)</article-title>
          .
          <source>In Proc. VLDB</source>
          , pages
          <fpage>1150</fpage>
          -
          <lpage>1160</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>