=Paper=
{{Paper
|id=Vol-2971/paper10
|storemode=property
|title=Main Memory Databases Instant Recovery
|pdfUrl=https://ceur-ws.org/Vol-2971/paper10.pdf
|volume=Vol-2971
|authors=Arlino Magalhães
|dblpUrl=https://dblp.org/rec/conf/vldb/Magalhaes21
}}
==Main Memory Databases Instant Recovery==
<pdf width="1500px">https://ceur-ws.org/Vol-2971/paper10.pdf</pdf>
<pre>
                             Main Memory Databases Instant Recovery
                                                      Arlino Henrique Magalhaes de Araujo
                                               Supervised by Jose Maria Monteiro and Angelo Brayner
                                                            Federal University of Ceara
                                                             Federal University of Piaui
                                                                  Fortaleza, Brazil
                                                                 arlino@ufpi.edu.br

ABSTRACT                                                                            This paper proposes an instant recovery approach for MMDBs.
The recovery process in main memory database systems (MMDBs)                     The proposed approach allows MMDBs to schedule new transac-
run in an offline way. Thus, MMDB only becomes available for new                 tions immediately after the failure during the recovery process,
transactions after the complete recovery process has finished. Some              giving the impression that the system was instantly restored. The
MMDBs maintain database replicas for assuring high availability af-              main idea of instant recovery is to organize the log file in a way that
ter systems failure. Nonetheless, a database replication mechanism               enables efficient on-demand and incremental recovery of individ-
is not immune to failures as well. For that reason, recovery tech-               ual database tuples. Furthermore, the paper presents a log record
niques are required to repair failed systems as quickly as possible.             propagation scheme (checkpoint) to accelerate recovery and free up
This work proposes an instant recovery strategy for MMDBs, which                 log space. The scheme uses an extension the fuzzy checkpoint ap-
makes MMDBs able to process transactions immediately after the                   proach to try not to degrade system performance. The checkpoint
recovery engine is triggered. The proposed approach rebuilds the                 can also propagate records during recovery and thus accelerate
database incrementally and on-demand. Besides, a novel checkpoint                next recoveries in case of successive failures.
technique is proposed to interfere as little as possible in the system              The remainder of this paper is organized as follows. Session 2
performance. The checkpoint technique can also act during the                    provides an overview of MMDB recovery and related work. Section
recovery process so that the next recoveries are faster in the face of           3 presents the proposed approach for database instant recovery.
successive failures. In order to validate the approach, simulations              Section 4 discusses the results of empirical experiments. Finally,
with a prototype implemented on Redis have been conducted over                   Section 5 concludes this paper.
Memtier benchmark. Preliminary results evidence the suitability of
the proposed recovery mechanism.
                                                                                 2   BACKGROUND AND RELATED WORK
                                                                                 Most MMDBs implements logical logging technique which records
                                                                                 higher-level database operations, such as inserting database tuples.
                                                                                 MMDBs produce only Redo log records of modified tuples to re-
1    INTRODUCTION
                                                                                 duce the amount of data written to secondary storage. The commit
It is a matter of fact that Main Memory Databases provide very high              processing uses group commit, i.e., it tries to group multiple log
throughput rates since the primary database is handled in main                   records into one large I/O [15, 18].
memory. Nevertheless, databases residing in a volatile memory                       The MMDB checkpoint materializes log logical operations to
are much more sensitive to system failures than traditional disk-                physical data on a checkpoint file. However, most MMDBs produces
resident. The recovery mechanism is responsible for restoring the                a consistent checkpoint file equivalent to a materialized database
database to the most recent consistent state before a system failure             state in an instant of time, commonly called snapshot [1–3].
[2, 7, 8].                                                                          Whenever a system crash occurs in an MMDB, the primary copy
    The recovery process for most MMDBs is performed offline,                    of the database is lost. Thus, the recovery manager should load the
meaning that the database and its applications only become avail-                last checkpoint into memory and redo log records [2, 14, 15].
able for new transactions after the full recovery process is com-                   Hekaton [1], VoltDB [14], HyPer [4], SAP HANA [3], and SiloR
pleted. One might claim that systems are capable of keeping data-                [18] are examples of modern MMDBs that perform the recovery
base replicas for high availability. In fact, with the advent of high-           activities mentioned above. Nevertheless, those systems do not
availability infrastructure, recovery speed has become secondary                 execute new transactions until the full recovery is completed.
in importance to runtime performance for most MMDBs. Never-                         PACMAN [15] and Adaptive Logging [16] utilize a dependency
theless, high availability infrastructure is not immune to human                 graph between transactions performed to identify opportunities
errors and unpredictable defects in software and firmware that are               for database recovery in parallel. Those systems must wait for the
a source of failures and might cause multiple and shared problems.               full database recovery to service new transactions.
Besides, it require an additional cost to the system infrastructure.                The Log-Structured Merge tree (LSM-tree) [10] provides low-
[2, 7, 12, 15].                                                                  cost indexing for a file that has a high rate of record insertions and
                                                                                 deletions. The LSM-tree access method uses a buffer to avoid multi-
                                                                                 ple I/Os in secondary memory for frequently referenced pages. This
Proceedings of the VLDB 2021 PhD Workshop, August 16th, 2021. Copenhagen,
Denmark. Copyright (C) 2021 for this paper by its authors. Use permitted under   approach is not suitable for writing log records since they require
Creative Commons License Attribution 4.0 International (CC BY 4.0).              immediate and atomic persistence during commit processing.
                                                                                                              Arlino Henrique Magalhaes de Araujo


   Sauer et al. [12, 13] present a technique for instant restoring
a disk-resident database. This technique uses a partitioned index
to write log records efficiently. After a crash, the recovery loads
pages from a backup device and their corresponding log records
from the indexed log. New transactions are allowed as soon as
their necessary pages are restored. As a disadvantage, when the
number of partitions increases, the system may search for multiple
partitions to retrieve a page, which might delay the recovery.
   In the paper [6] the authors provide a survey of techniques for
implementing recovery in MMDBs. Besides, the authors describe
the main features of recovery mechanisms delivered by well-known
MMDBs.

3     THE INSTANT RECOVERY MECHANISM                                           Figure 2: Sequential log (a), and indexed log (b).
The existing MMDBs recovery strategies use a sequential log that
makes instant recovery impossible. The recovery by a sequential
                                                                           The proposed recovery scheme requires efficient log reading to
log is not incremental and requires full recovery before any tuple
                                                                        fetch the records in order to redo a given tuple during recovery.
can be accessed. This scenario does not allow the system to execute
                                                                        For this reason, the logging strategy implements an indexed log.
an on-demand transaction during recovery since the required tuples
                                                                        The index structure is an extension to 𝐵 + -tree in which each node
for the transaction can only be accessed after the recovery process
                                                                        contains a tuple ID and the log records generated by transaction
has finished [2, 7, 12].
                                                                        updates on that tuple. Therefore, the log granularity is tuple. One
   The instant recovery technique presented in this work builds
                                                                        probe on 𝐵 + -tree retrieves all records to restore a single tuple. The
the log file as an index structure. This log organization enables
                                                                        Indexer component indexes records from the sequential log to the
an efficient restoration of a tuple. A single fetch on the indexed
                                                                        indexed log asynchronous to transaction commit, i.e., a transaction
log can restore one tuple. Thus, the system can use the indexed
                                                                        does not need to wait for the log indexing to commit its execution.
log to recover a database by restoring tuple by tuple incrementally.
                                                                        Records can be removed from the sequential log after they are
This technique naturally supports database availability since a new
                                                                        indexed in the 𝐵 + -tree. However, the sequential log is maintained to
transaction can access a tuple immediately after the tuple is re-
                                                                        ensure consistent database recovery in the event of index corruption.
stored, i.e., transactions do not have to wait for a full recovery to
                                                                        In this case, the system must build a new indexed log from the
access restored tuples. Figure 1 shows the architecture to imple-
                                                                        sequential log. This process delays the start of recovery process.
ment the proposed MMDB instant recovery approach. The next
                                                                           The primary purpose of instant recovery is to restore the data-
subsections discuss the main components of the architecture and
                                                                        base efficiently without degrading the transaction throughout pro-
their interactions.
                                                                        vided by the system. The indexed log requires random writes, while
                                                                        a sequential log has a sequential write pattern. Writing records
                                                                        into a sequential log file is potentially faster than doing so into
                                                                        an indexed log file. For this reason, in our approach, log records
                                                                        are written to the sequential file since they have efficient record
                                                                        writes. Periodically, the log records are flushed from sequential log
                                                                        to indexed log file. The indexing process occurs asynchronously to
                                                                        the transaction commit operation so as not to degrade the process-
                                                                        ing. Therefore, the proposed log organization can flush log records
                                                                        efficiently to a sequential log during transaction processing and it
Figure 1: Architecture of the instant recovery mechanism.
                                                                        needs only one fetch on the indexed log to retrieve all log records
                                                                        required to restore one tuple.
                                                                           Figure 2 illustrates logging process. Observe that transactions
3.1    The Logging Strategy                                             Tx1, Tx2, and Tx3 generated log records for updates performed in
The proposed approach for MMDB instant recovery employs two             tuples Tp1, Tp2, and Tp3. Sequential log stores the records flushed
log files: a sequential log (Figure 2 (a)), and an indexed log (Fig-    by the three transactions. The log records with LSN 11, 13, 16, and
ure 2 (b)). Each record in the sequential log represents an update      19 represent the last update performed in tuple Tp1, for example. A
performed on a tuple by a transaction. During transaction process-      fetch on the indexed log can retrieve the records of LSN 11, 13, 16,
ing, each transaction generates Redo records that are kept in a         and 19 to redo the tuple Tp1. The absence of an index implies the
thread-local. During the commitment, all log records generated by       necessity of a full scan on the sequential log to restore Tp1.
a transaction are appended atomically on the sequential log by the
Logger component. The Log Sequence Number (LSN) represents              3.2    The Checkpoint Strategy
the order in which a record was stored. This scheme ensures log         During transaction processing, the Accesses Logger component
consistency to recover the database.                                    stores tuple access information. The Checkpointer component uses
Main Memory Databases Instant Recovery


those information to identify the most frequently accessed tuples.          3.4     The Evaluation Prototype
These tuples are associated with most of the generated log records.         The instant recovery approach proposed in this paper was imple-
Periodically, the Checkpointer starts the checkpoint process that           mented in Redis 5.0.7 [11] to evaluate the feasibility of indexing
is an extension of the fuzzy checkpoint approach. The checkpoint            for log replay. The evaluation prototype can be downloaded1 . Re-
propagates the record actions in the indexed log of the most fre-           dis is an open-source in-memory data structure store used as an
quently used tuples.                                                        in-memory key-value database. Redis is written in ANSI C, has
   For each tuple T, in which T is one of the most frequently ac-           a simple architecture, and its source code is easy to understand.
cessed tuples, the Checkpointer retrieves the set S of log records          These characteristics facilitated the development of the prototype.
contained in node N of 𝐵 + -tree (in the indexed log) associated with          Redis uses an append-only file (AOF) to write log records and
T. The set S is able to redo the tuple T entirely in case of failure.       generates snapshots at regular intervals as a binary dump using
Then, a new log record L, whose effect is equivalent to that of the         the Redis RDB Dump File Format. Redis can automatically rewrite
set S to redo tuple T, is generated. Finally, log record L replaces         the AOF in the background in case its size exceeds the optimal [11].
set S on node N. This process is useful to decrease the number of           The prototype uses the AOF from Redis, i.e., we did not need to
log records to be processed during recovery. As a result, the check-        implement a sequential log. The RDB was disabled in the prototype.
point can accelerate the recovery process and liberate log space.           Moreover, the system does not rewrite the log.
The checkpoint is performed asynchronously to the transaction                  The 𝐵 + -tree of the indexed log was implemented using Berkeley
processing.                                                                 DB 4.8 [9]. Berkeley Database (Berkeley DB or BDB) is a software
                                                                            library intended to provide a high-performance embedded database
                                                                            for key/value data. BDB allows the specification of the underlying
                                                                            organization of the data in various database implementations (e.g., b-
3.3    The Recovery Process
                                                                            tree, hash, queue, and recno). The checkpoint component proposed
After a system failure, the system should initiate the database re-         in this work is still under development. Thus, the experiments
covery by restoring tuples through the indexed log. However, the            presented in the next session do not include this module.
record indexing process is asynchronous to the transaction commit.
As a result, some records on the sequential log may not have been           4     RECOVERY MECHANISM EVALUATION
indexed before a failure. Therefore, immediately before starting
                                                                            We have empirically evaluated the proposed instant recovery mech-
recovery, the system must verify if any records have not yet been
                                                                            anism in order to present its efficiency and suitability to be imple-
indexed. The Indexer component must index those records to en-
                                                                            mented in MMDBs. All experiments were executed with 4 worker
sure recovery consistency. When this process ends, recovery can
                                                                            threads on Intel Core i7-9700k CPU 3.60GHz x 8. The system has
begin and new transactions can be performed. Thus, the Restorer
                                                                            64GB of RAM and 400GB of SSD Kingston SA400S37 as a persistent
component begins redoing tuples by traversing the indexed log 𝐵 + -
                                                                            storage device. The operating system was Ubuntu Linux 18.04.2
Tree. Each visit to a 𝐵 + -tree node can retrieve the update records to
                                                                            LTS. We used the Memtier benchmark [5, 17] to perform the tests.
redo a tuple. When a tuple is redone, its key is marked as restored.
                                                                            Memtier is a high-throughput benchmarking tool for Redis devel-
After visiting all 𝐵 + -tree nodes, all database tuples are restored, and
                                                                            oped by Redis Labs. The tool offers options to generate various
the recovery process is completed.
                                                                            workload patterns. Memtier has already been used in several scien-
   The indexed log recovery scheme can naturally support avail-
                                                                            tific works, such as this one [17], for example.
ability since new transactions can be executed immediately after
restoring their required tuples. Furthermore, this recovery scheme          4.1     Preliminary Experiments
can service new transactions whose necessary tuples have not yet
been loaded into memory during recovery. When a transaction                 The experiments were focused on measuring the time to fully re-
requires tuples, the Scheduler component checks whether the tu-             cover a database, the availability to process transactions after a
ples have already been restored. If they are not in the memory, the         system failure, the time needed to run a workload entirely, and log-
Scheduler must request the Restorer for these tuples on-demand.             ging overhead. These experiments were performed on a database
Then, the recovery manager should pause the incremental recovery            containing 99, 507 keys that generated an 11.8GB sequential log
(the traversing in 𝐵 + -tree) and begin fetching the necessary tuples       file containing 160 million records. Additionally, an indexed log
for the transaction from the indexed log. After the transaction’s           was generated along with this sequential log using the recovery
tuples are restored, they are marked as restored, the transaction           technique proposed in this work. For each experiment, the system
can run, and the system can continue the incremental recovery.              was shut down to simulate a failure. At the database restart, as
   During recovery, the checkpoint propagates log record actions            soon as the recovery process was triggered, a workload would be
similarly to the algorithm shown in Section 3.2. However, the prop-         submitted. Thus, one could measure transaction throughput and
agation is applied to the log records of each restored tuple, rather        recovery time from system restart.
than just the most frequently used tuples. Therefore, after a tuple             The key goal was to compare the proposed instant recovery
is restored, the Restorer requires the Checkpointer to perform a            approach to the traditional MMDB recovery. However, we also
checkpoint for the log records for that tuple. The contents of the re-      tested our instant recovery scheme in different scenarios to confirm
stored tuple are used to generate a new log record that replaces the        the following expectations about our technique: (1) an indexed log
log records in the 𝐵 + -tree node used to redo the tuple. If successive     must be employed to incrementally and on-demand recover the
failures occur, the next recovery will process fewer log records.           1 https://drive.google.com/drive/folders/1LTbtY36O0kWIpxZBM-hc1BPvIjICuy2F
                                                                                                                        Arlino Henrique Magalhaes de Araujo


database, and (2) the asynchronous indexing of log records must           5    CONCLUSION
be adopted to avoid transaction processing overhead. Thus, the            This paper proposed an instant recovery approach for MMDBs.
experiments have been conducted in the three following scenarios:         The approach allows new transactions to run concurrently to the
(i) Sequential Log Recovery - SLR; (ii) Asynchronous Indexed Log          recovery process. The approach takes benefit of using a log file
Instant Recovery - AILIR; (iii) Synchronous Indexed Log Instant           with a 𝐵 + -tree structure. Thus the recovery mechanism is able to
Recovery - SILIR.                                                         seek tuples directly on the log file to rebuild the database in an on-
    The SLR scenario (traditional recovery) uses only a sequential log.   demand and incremental fashion. New transactions are scheduled
In this scenario, transaction update records are written to a sequen-     as soon as required tuples are restored into the MMDB.
tial log file during transaction processing. The recovery process            The results show that instant recovery reduces the perceived time
recovers the database by scanning the entire log file. The AILIR sce-     to repair the database, seeing that transactions can be performed
nario (our approach) uses a sequential log + indexed log. In AILIR,       since the system is restarted. In other words, it can effectively de-
transaction update records are written in a sequential log, during        liver tuples that new transactions need during the recovery process.
transaction processing, and stored asynchronously to the transac-         The experiments also analyzed the impact of using a log indexed
tion commit in an indexed log. The SILIR scenario (which is derived       structure on transaction throughput rates in an OLTP workload
from AILIR) uses only an indexed log. In SILIR, transaction update        benchmark. We believe that adding a checkpoint module to the
records are written directly to an indexed log synchronously to the       prototype developed in this work will increase system availability
transaction commit. After a failure, for both scenarios ii (AILIR)        and provide faster database recovery.
and iii (SILIR), the recovery manager must traverse the 𝐵 + -tree
to recover the database. The SILIR scenario was created to mea-           REFERENCES
sure the log indexing overhead during transaction processing and           [1] Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin Mittal,
instant recovery processing. For each scenario, the experiments                Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013. Hekaton: SQL server’s
                                                                               memory-optimized OLTP engine. In Proceedings of the 2013 ACM SIGMOD Inter-
were performed on workloads with a 5: 5 ratio between read and                 national Conference on Management of Data. ACM, 1243–1254.
write operations. They were simulated using the Memtier bench-             [2] Franz Faerber, Alfons Kemper, Per-Åke Larson, Justin Levandoski, Thomas Neu-
mark which operated 4 worker threads, with each thread driving                 mann, Andrew Pavlo, et al. 2017. Main Memory Database Systems. Foundations
                                                                               and Trends® in Databases 8, 1-2 (2017), 1–130.
50 clients. Each client made 170,000 requests in a random pattern.         [3] Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg,
    The results of recovery experiments are in Figure 3. The vertical          and Wolfgang Lehner. 2012. SAP HANA database: data management for modern
dashed lines in the figure indicate the final recovery time of the re-         business applications. ACM Sigmod Record 40, 4 (2012), 45–51.
                                                                           [4] Florian Funke, Alfons Kemper, Tobias Mühlbauer, Thomas Neumann, and Viktor
spective color approach. Besides, it did not overload the throughput           Leis. 2014. HyPer Beyond Software: Exploiting Modern Hardware for Main-
of transactions since its throughput was similar to that of the default        Memory Database Systems. Datenbank-Spektrum 14, 3 (2014), 173–181.
                                                                           [5] Redis Labs. 2020. Redis Labs | The Best Redis Experience. Retrieved October 06,
approach (SLR). This result was already expected because AILIR                 2020 from https://redislabs.com
and SLR flush log records to secondary memory in a similar manner,         [6] Arlino Magalhaes, Jose Maria Monteiro, and Angelo Brayner. 2021. Main Memory
except that AILIR additionally indexes the log records. However,               Database Recovery: A Survey. ACM Computing Surveys (CSUR) 54, 2 (2021), 1–36.
                                                                           [7] Nirmesh Malviya, Ariel Weisberg, Samuel Madden, and Michael Stonebraker.
the indexing did not interfere with the transaction throughput be-             2014. Rethinking main memory oltp recovery. In Data Engineering (ICDE), 2014
cause it is performed asynchronously to transaction commit. SILIR              IEEE 30th International Conference on. IEEE, 604–615.
had the worst performance due to its synchronous log indexing,             [8] C Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz.
                                                                               1992. ARIES: a transaction recovery method supporting fine-granularity locking
i.e., a transaction must wait for indexing to confirm its writes. Al-          and partial rollbacks using write-ahead logging. ACM Transactions on Database
though the SLR recovered the database before AILIR, the AILIR                  Systems (TODS) 17, 1 (1992), 94–162.
                                                                           [9] Michael A Olson, Keith Bostic, and Margo I Seltzer. 1999. Berkeley DB.. In USENIX
was able to perform transactions since system restart and was the              Annual Technical Conference, FREENIX Track. 183–191.
fastest approach to finish the workload execution. This result was        [10] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The
achieved because AILIR has asynchronous indexing and can process               log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351–385.
                                                                          [11] Redis. 2020. Redis. Retrieved August 26, 2020 from https://redis.io
transactions while the system is recovering. Additionally, the client     [12] Caetano Sauer, Goetz Graefe, and Theo Härder. 2017. Instant restore after a media
application did not notice the AILIR recovery, giving the impression           failure. In Advances in Databases and Information Systems. Springer, 311–325.
that the recovery was instantaneous.                                      [13] Caetano Sauer, Goetz Graefe, and Theo Härder. 2018. FineLine: log-structured
                                                                               transactional storage and recovery. Proceedings of the VLDB Endowment 11, 13
                                                                               (2018), 2249–2262.
                                                                          [14] Michael Stonebraker and Ariel Weisberg. 2013. The VoltDB Main Memory DBMS.
                                                                               IEEE Data Eng. Bull. 36, 2 (2013), 21–27.
                                                                          [15] Yingjun Wu, Wentian Guo, Chee-Yong Chan, and Kian-Lee Tan. 2017. Fast Failure
                                                                               Recovery for Main-Memory DBMSs on Multicores. In Proceedings of the 2017
                                                                               ACM International Conference on Management of Data. ACM, 267–281.
                                                                          [16] Chang Yao, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, and Sai Wu. 2016.
                                                                               Adaptive logging: Optimizing logging and recovery costs in distributed in-
                                                                               memory databases. In Proceedings of the 2016 International Conference on Man-
                                                                               agement of Data. ACM, 1119–1134.
                                                                          [17] Yiying Zhang and Steven Swanson. 2015. A study of application performance
                                                                               with non-volatile main memory. In 2015 31st Symposium on Mass Storage Systems
                                                                               and Technologies (MSST). IEEE, 1–10.
                                                                          [18] Wenting Zheng, Stephen Tu, Eddie Kohler, and Barbara Liskov. 2014. Fast
                                                                               Databases with Fast Durability and Recovery Through Multicore Parallelism. In
                                                                               OSDI, Vol. 14. 465–477.
                  Figure 3: Experiment results

</pre>