=Paper=
{{Paper
|id=Vol-2971/paper10
|storemode=property
|title=Main Memory Databases Instant Recovery
|pdfUrl=https://ceur-ws.org/Vol-2971/paper10.pdf
|volume=Vol-2971
|authors=Arlino Magalhães
|dblpUrl=https://dblp.org/rec/conf/vldb/Magalhaes21
}}
==Main Memory Databases Instant Recovery==
Main Memory Databases Instant Recovery Arlino Henrique Magalhaes de Araujo Supervised by Jose Maria Monteiro and Angelo Brayner Federal University of Ceara Federal University of Piaui Fortaleza, Brazil arlino@ufpi.edu.br ABSTRACT This paper proposes an instant recovery approach for MMDBs. The recovery process in main memory database systems (MMDBs) The proposed approach allows MMDBs to schedule new transac- run in an offline way. Thus, MMDB only becomes available for new tions immediately after the failure during the recovery process, transactions after the complete recovery process has finished. Some giving the impression that the system was instantly restored. The MMDBs maintain database replicas for assuring high availability af- main idea of instant recovery is to organize the log file in a way that ter systems failure. Nonetheless, a database replication mechanism enables efficient on-demand and incremental recovery of individ- is not immune to failures as well. For that reason, recovery tech- ual database tuples. Furthermore, the paper presents a log record niques are required to repair failed systems as quickly as possible. propagation scheme (checkpoint) to accelerate recovery and free up This work proposes an instant recovery strategy for MMDBs, which log space. The scheme uses an extension the fuzzy checkpoint ap- makes MMDBs able to process transactions immediately after the proach to try not to degrade system performance. The checkpoint recovery engine is triggered. The proposed approach rebuilds the can also propagate records during recovery and thus accelerate database incrementally and on-demand. Besides, a novel checkpoint next recoveries in case of successive failures. technique is proposed to interfere as little as possible in the system The remainder of this paper is organized as follows. Session 2 performance. The checkpoint technique can also act during the provides an overview of MMDB recovery and related work. Section recovery process so that the next recoveries are faster in the face of 3 presents the proposed approach for database instant recovery. successive failures. In order to validate the approach, simulations Section 4 discusses the results of empirical experiments. Finally, with a prototype implemented on Redis have been conducted over Section 5 concludes this paper. Memtier benchmark. Preliminary results evidence the suitability of the proposed recovery mechanism. 2 BACKGROUND AND RELATED WORK Most MMDBs implements logical logging technique which records higher-level database operations, such as inserting database tuples. MMDBs produce only Redo log records of modified tuples to re- 1 INTRODUCTION duce the amount of data written to secondary storage. The commit It is a matter of fact that Main Memory Databases provide very high processing uses group commit, i.e., it tries to group multiple log throughput rates since the primary database is handled in main records into one large I/O [15, 18]. memory. Nevertheless, databases residing in a volatile memory The MMDB checkpoint materializes log logical operations to are much more sensitive to system failures than traditional disk- physical data on a checkpoint file. However, most MMDBs produces resident. The recovery mechanism is responsible for restoring the a consistent checkpoint file equivalent to a materialized database database to the most recent consistent state before a system failure state in an instant of time, commonly called snapshot [1–3]. [2, 7, 8]. Whenever a system crash occurs in an MMDB, the primary copy The recovery process for most MMDBs is performed offline, of the database is lost. Thus, the recovery manager should load the meaning that the database and its applications only become avail- last checkpoint into memory and redo log records [2, 14, 15]. able for new transactions after the full recovery process is com- Hekaton [1], VoltDB [14], HyPer [4], SAP HANA [3], and SiloR pleted. One might claim that systems are capable of keeping data- [18] are examples of modern MMDBs that perform the recovery base replicas for high availability. In fact, with the advent of high- activities mentioned above. Nevertheless, those systems do not availability infrastructure, recovery speed has become secondary execute new transactions until the full recovery is completed. in importance to runtime performance for most MMDBs. Never- PACMAN [15] and Adaptive Logging [16] utilize a dependency theless, high availability infrastructure is not immune to human graph between transactions performed to identify opportunities errors and unpredictable defects in software and firmware that are for database recovery in parallel. Those systems must wait for the a source of failures and might cause multiple and shared problems. full database recovery to service new transactions. Besides, it require an additional cost to the system infrastructure. The Log-Structured Merge tree (LSM-tree) [10] provides low- [2, 7, 12, 15]. cost indexing for a file that has a high rate of record insertions and deletions. The LSM-tree access method uses a buffer to avoid multi- ple I/Os in secondary memory for frequently referenced pages. This Proceedings of the VLDB 2021 PhD Workshop, August 16th, 2021. Copenhagen, Denmark. Copyright (C) 2021 for this paper by its authors. Use permitted under approach is not suitable for writing log records since they require Creative Commons License Attribution 4.0 International (CC BY 4.0). immediate and atomic persistence during commit processing. Arlino Henrique Magalhaes de Araujo Sauer et al. [12, 13] present a technique for instant restoring a disk-resident database. This technique uses a partitioned index to write log records efficiently. After a crash, the recovery loads pages from a backup device and their corresponding log records from the indexed log. New transactions are allowed as soon as their necessary pages are restored. As a disadvantage, when the number of partitions increases, the system may search for multiple partitions to retrieve a page, which might delay the recovery. In the paper [6] the authors provide a survey of techniques for implementing recovery in MMDBs. Besides, the authors describe the main features of recovery mechanisms delivered by well-known MMDBs. 3 THE INSTANT RECOVERY MECHANISM Figure 2: Sequential log (a), and indexed log (b). The existing MMDBs recovery strategies use a sequential log that makes instant recovery impossible. The recovery by a sequential The proposed recovery scheme requires efficient log reading to log is not incremental and requires full recovery before any tuple fetch the records in order to redo a given tuple during recovery. can be accessed. This scenario does not allow the system to execute For this reason, the logging strategy implements an indexed log. an on-demand transaction during recovery since the required tuples The index structure is an extension to 𝐵 + -tree in which each node for the transaction can only be accessed after the recovery process contains a tuple ID and the log records generated by transaction has finished [2, 7, 12]. updates on that tuple. Therefore, the log granularity is tuple. One The instant recovery technique presented in this work builds probe on 𝐵 + -tree retrieves all records to restore a single tuple. The the log file as an index structure. This log organization enables Indexer component indexes records from the sequential log to the an efficient restoration of a tuple. A single fetch on the indexed indexed log asynchronous to transaction commit, i.e., a transaction log can restore one tuple. Thus, the system can use the indexed does not need to wait for the log indexing to commit its execution. log to recover a database by restoring tuple by tuple incrementally. Records can be removed from the sequential log after they are This technique naturally supports database availability since a new indexed in the 𝐵 + -tree. However, the sequential log is maintained to transaction can access a tuple immediately after the tuple is re- ensure consistent database recovery in the event of index corruption. stored, i.e., transactions do not have to wait for a full recovery to In this case, the system must build a new indexed log from the access restored tuples. Figure 1 shows the architecture to imple- sequential log. This process delays the start of recovery process. ment the proposed MMDB instant recovery approach. The next The primary purpose of instant recovery is to restore the data- subsections discuss the main components of the architecture and base efficiently without degrading the transaction throughout pro- their interactions. vided by the system. The indexed log requires random writes, while a sequential log has a sequential write pattern. Writing records into a sequential log file is potentially faster than doing so into an indexed log file. For this reason, in our approach, log records are written to the sequential file since they have efficient record writes. Periodically, the log records are flushed from sequential log to indexed log file. The indexing process occurs asynchronously to the transaction commit operation so as not to degrade the process- ing. Therefore, the proposed log organization can flush log records efficiently to a sequential log during transaction processing and it Figure 1: Architecture of the instant recovery mechanism. needs only one fetch on the indexed log to retrieve all log records required to restore one tuple. Figure 2 illustrates logging process. Observe that transactions 3.1 The Logging Strategy Tx1, Tx2, and Tx3 generated log records for updates performed in The proposed approach for MMDB instant recovery employs two tuples Tp1, Tp2, and Tp3. Sequential log stores the records flushed log files: a sequential log (Figure 2 (a)), and an indexed log (Fig- by the three transactions. The log records with LSN 11, 13, 16, and ure 2 (b)). Each record in the sequential log represents an update 19 represent the last update performed in tuple Tp1, for example. A performed on a tuple by a transaction. During transaction process- fetch on the indexed log can retrieve the records of LSN 11, 13, 16, ing, each transaction generates Redo records that are kept in a and 19 to redo the tuple Tp1. The absence of an index implies the thread-local. During the commitment, all log records generated by necessity of a full scan on the sequential log to restore Tp1. a transaction are appended atomically on the sequential log by the Logger component. The Log Sequence Number (LSN) represents 3.2 The Checkpoint Strategy the order in which a record was stored. This scheme ensures log During transaction processing, the Accesses Logger component consistency to recover the database. stores tuple access information. The Checkpointer component uses Main Memory Databases Instant Recovery those information to identify the most frequently accessed tuples. 3.4 The Evaluation Prototype These tuples are associated with most of the generated log records. The instant recovery approach proposed in this paper was imple- Periodically, the Checkpointer starts the checkpoint process that mented in Redis 5.0.7 [11] to evaluate the feasibility of indexing is an extension of the fuzzy checkpoint approach. The checkpoint for log replay. The evaluation prototype can be downloaded1 . Re- propagates the record actions in the indexed log of the most fre- dis is an open-source in-memory data structure store used as an quently used tuples. in-memory key-value database. Redis is written in ANSI C, has For each tuple T, in which T is one of the most frequently ac- a simple architecture, and its source code is easy to understand. cessed tuples, the Checkpointer retrieves the set S of log records These characteristics facilitated the development of the prototype. contained in node N of 𝐵 + -tree (in the indexed log) associated with Redis uses an append-only file (AOF) to write log records and T. The set S is able to redo the tuple T entirely in case of failure. generates snapshots at regular intervals as a binary dump using Then, a new log record L, whose effect is equivalent to that of the the Redis RDB Dump File Format. Redis can automatically rewrite set S to redo tuple T, is generated. Finally, log record L replaces the AOF in the background in case its size exceeds the optimal [11]. set S on node N. This process is useful to decrease the number of The prototype uses the AOF from Redis, i.e., we did not need to log records to be processed during recovery. As a result, the check- implement a sequential log. The RDB was disabled in the prototype. point can accelerate the recovery process and liberate log space. Moreover, the system does not rewrite the log. The checkpoint is performed asynchronously to the transaction The 𝐵 + -tree of the indexed log was implemented using Berkeley processing. DB 4.8 [9]. Berkeley Database (Berkeley DB or BDB) is a software library intended to provide a high-performance embedded database for key/value data. BDB allows the specification of the underlying organization of the data in various database implementations (e.g., b- 3.3 The Recovery Process tree, hash, queue, and recno). The checkpoint component proposed After a system failure, the system should initiate the database re- in this work is still under development. Thus, the experiments covery by restoring tuples through the indexed log. However, the presented in the next session do not include this module. record indexing process is asynchronous to the transaction commit. As a result, some records on the sequential log may not have been 4 RECOVERY MECHANISM EVALUATION indexed before a failure. Therefore, immediately before starting We have empirically evaluated the proposed instant recovery mech- recovery, the system must verify if any records have not yet been anism in order to present its efficiency and suitability to be imple- indexed. The Indexer component must index those records to en- mented in MMDBs. All experiments were executed with 4 worker sure recovery consistency. When this process ends, recovery can threads on Intel Core i7-9700k CPU 3.60GHz x 8. The system has begin and new transactions can be performed. Thus, the Restorer 64GB of RAM and 400GB of SSD Kingston SA400S37 as a persistent component begins redoing tuples by traversing the indexed log 𝐵 + - storage device. The operating system was Ubuntu Linux 18.04.2 Tree. Each visit to a 𝐵 + -tree node can retrieve the update records to LTS. We used the Memtier benchmark [5, 17] to perform the tests. redo a tuple. When a tuple is redone, its key is marked as restored. Memtier is a high-throughput benchmarking tool for Redis devel- After visiting all 𝐵 + -tree nodes, all database tuples are restored, and oped by Redis Labs. The tool offers options to generate various the recovery process is completed. workload patterns. Memtier has already been used in several scien- The indexed log recovery scheme can naturally support avail- tific works, such as this one [17], for example. ability since new transactions can be executed immediately after restoring their required tuples. Furthermore, this recovery scheme 4.1 Preliminary Experiments can service new transactions whose necessary tuples have not yet been loaded into memory during recovery. When a transaction The experiments were focused on measuring the time to fully re- requires tuples, the Scheduler component checks whether the tu- cover a database, the availability to process transactions after a ples have already been restored. If they are not in the memory, the system failure, the time needed to run a workload entirely, and log- Scheduler must request the Restorer for these tuples on-demand. ging overhead. These experiments were performed on a database Then, the recovery manager should pause the incremental recovery containing 99, 507 keys that generated an 11.8GB sequential log (the traversing in 𝐵 + -tree) and begin fetching the necessary tuples file containing 160 million records. Additionally, an indexed log for the transaction from the indexed log. After the transaction’s was generated along with this sequential log using the recovery tuples are restored, they are marked as restored, the transaction technique proposed in this work. For each experiment, the system can run, and the system can continue the incremental recovery. was shut down to simulate a failure. At the database restart, as During recovery, the checkpoint propagates log record actions soon as the recovery process was triggered, a workload would be similarly to the algorithm shown in Section 3.2. However, the prop- submitted. Thus, one could measure transaction throughput and agation is applied to the log records of each restored tuple, rather recovery time from system restart. than just the most frequently used tuples. Therefore, after a tuple The key goal was to compare the proposed instant recovery is restored, the Restorer requires the Checkpointer to perform a approach to the traditional MMDB recovery. However, we also checkpoint for the log records for that tuple. The contents of the re- tested our instant recovery scheme in different scenarios to confirm stored tuple are used to generate a new log record that replaces the the following expectations about our technique: (1) an indexed log log records in the 𝐵 + -tree node used to redo the tuple. If successive must be employed to incrementally and on-demand recover the failures occur, the next recovery will process fewer log records. 1 https://drive.google.com/drive/folders/1LTbtY36O0kWIpxZBM-hc1BPvIjICuy2F Arlino Henrique Magalhaes de Araujo database, and (2) the asynchronous indexing of log records must 5 CONCLUSION be adopted to avoid transaction processing overhead. Thus, the This paper proposed an instant recovery approach for MMDBs. experiments have been conducted in the three following scenarios: The approach allows new transactions to run concurrently to the (i) Sequential Log Recovery - SLR; (ii) Asynchronous Indexed Log recovery process. The approach takes benefit of using a log file Instant Recovery - AILIR; (iii) Synchronous Indexed Log Instant with a 𝐵 + -tree structure. Thus the recovery mechanism is able to Recovery - SILIR. seek tuples directly on the log file to rebuild the database in an on- The SLR scenario (traditional recovery) uses only a sequential log. demand and incremental fashion. New transactions are scheduled In this scenario, transaction update records are written to a sequen- as soon as required tuples are restored into the MMDB. tial log file during transaction processing. The recovery process The results show that instant recovery reduces the perceived time recovers the database by scanning the entire log file. The AILIR sce- to repair the database, seeing that transactions can be performed nario (our approach) uses a sequential log + indexed log. In AILIR, since the system is restarted. In other words, it can effectively de- transaction update records are written in a sequential log, during liver tuples that new transactions need during the recovery process. transaction processing, and stored asynchronously to the transac- The experiments also analyzed the impact of using a log indexed tion commit in an indexed log. The SILIR scenario (which is derived structure on transaction throughput rates in an OLTP workload from AILIR) uses only an indexed log. In SILIR, transaction update benchmark. We believe that adding a checkpoint module to the records are written directly to an indexed log synchronously to the prototype developed in this work will increase system availability transaction commit. After a failure, for both scenarios ii (AILIR) and provide faster database recovery. and iii (SILIR), the recovery manager must traverse the 𝐵 + -tree to recover the database. The SILIR scenario was created to mea- REFERENCES sure the log indexing overhead during transaction processing and [1] Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin Mittal, instant recovery processing. For each scenario, the experiments Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013. Hekaton: SQL server’s memory-optimized OLTP engine. In Proceedings of the 2013 ACM SIGMOD Inter- were performed on workloads with a 5: 5 ratio between read and national Conference on Management of Data. ACM, 1243–1254. write operations. They were simulated using the Memtier bench- [2] Franz Faerber, Alfons Kemper, Per-Åke Larson, Justin Levandoski, Thomas Neu- mark which operated 4 worker threads, with each thread driving mann, Andrew Pavlo, et al. 2017. Main Memory Database Systems. Foundations and Trends® in Databases 8, 1-2 (2017), 1–130. 50 clients. Each client made 170,000 requests in a random pattern. [3] Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, The results of recovery experiments are in Figure 3. The vertical and Wolfgang Lehner. 2012. SAP HANA database: data management for modern dashed lines in the figure indicate the final recovery time of the re- business applications. ACM Sigmod Record 40, 4 (2012), 45–51. [4] Florian Funke, Alfons Kemper, Tobias Mühlbauer, Thomas Neumann, and Viktor spective color approach. Besides, it did not overload the throughput Leis. 2014. HyPer Beyond Software: Exploiting Modern Hardware for Main- of transactions since its throughput was similar to that of the default Memory Database Systems. Datenbank-Spektrum 14, 3 (2014), 173–181. [5] Redis Labs. 2020. Redis Labs | The Best Redis Experience. Retrieved October 06, approach (SLR). This result was already expected because AILIR 2020 from https://redislabs.com and SLR flush log records to secondary memory in a similar manner, [6] Arlino Magalhaes, Jose Maria Monteiro, and Angelo Brayner. 2021. Main Memory except that AILIR additionally indexes the log records. However, Database Recovery: A Survey. ACM Computing Surveys (CSUR) 54, 2 (2021), 1–36. [7] Nirmesh Malviya, Ariel Weisberg, Samuel Madden, and Michael Stonebraker. the indexing did not interfere with the transaction throughput be- 2014. Rethinking main memory oltp recovery. In Data Engineering (ICDE), 2014 cause it is performed asynchronously to transaction commit. SILIR IEEE 30th International Conference on. IEEE, 604–615. had the worst performance due to its synchronous log indexing, [8] C Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. 1992. ARIES: a transaction recovery method supporting fine-granularity locking i.e., a transaction must wait for indexing to confirm its writes. Al- and partial rollbacks using write-ahead logging. ACM Transactions on Database though the SLR recovered the database before AILIR, the AILIR Systems (TODS) 17, 1 (1992), 94–162. [9] Michael A Olson, Keith Bostic, and Margo I Seltzer. 1999. Berkeley DB.. In USENIX was able to perform transactions since system restart and was the Annual Technical Conference, FREENIX Track. 183–191. fastest approach to finish the workload execution. This result was [10] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The achieved because AILIR has asynchronous indexing and can process log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351–385. [11] Redis. 2020. Redis. Retrieved August 26, 2020 from https://redis.io transactions while the system is recovering. Additionally, the client [12] Caetano Sauer, Goetz Graefe, and Theo Härder. 2017. Instant restore after a media application did not notice the AILIR recovery, giving the impression failure. In Advances in Databases and Information Systems. Springer, 311–325. that the recovery was instantaneous. [13] Caetano Sauer, Goetz Graefe, and Theo Härder. 2018. FineLine: log-structured transactional storage and recovery. Proceedings of the VLDB Endowment 11, 13 (2018), 2249–2262. [14] Michael Stonebraker and Ariel Weisberg. 2013. The VoltDB Main Memory DBMS. IEEE Data Eng. Bull. 36, 2 (2013), 21–27. [15] Yingjun Wu, Wentian Guo, Chee-Yong Chan, and Kian-Lee Tan. 2017. Fast Failure Recovery for Main-Memory DBMSs on Multicores. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 267–281. [16] Chang Yao, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, and Sai Wu. 2016. Adaptive logging: Optimizing logging and recovery costs in distributed in- memory databases. In Proceedings of the 2016 International Conference on Man- agement of Data. ACM, 1119–1134. [17] Yiying Zhang and Steven Swanson. 2015. A study of application performance with non-volatile main memory. In 2015 31st Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 1–10. [18] Wenting Zheng, Stephen Tu, Eddie Kohler, and Barbara Liskov. 2014. Fast Databases with Fast Durability and Recovery Through Multicore Parallelism. In OSDI, Vol. 14. 465–477. Figure 3: Experiment results