Merkle Hash Tree based Techniques for Data Integrity of Outsourced Data Muhammad Saqib Niaz Gunter Saake Dept. of Computer Science Dept. of Computer Science Otto von Guericke University Otto von Guericke University Magdeburg, Germany Magdeburg, Germany saqib@iti.cs.uni-magdeburg.de gunter.saake@ovgu.de ABSTRACT and DSP and the links between clients and DSP are secure One of the problems associated with outsourcing data to using some technique like SSL and the forgery of the data cloud service providers is the data integrity of outsourced over these links can be easily detectable. data. In this paper we present data integrity techniques Following are some the features that need to be considered for the outsourced data. Data integrity encompasses the in designing data integrity techniques for outsourced data: completeness, correctness and freshness of the data. This • Computation overhead for the DO paper focuses on the Merkle Hash Tree based data integrity techniques. It also presents the techniques for storage and • Computation overhead of the DSP retrieval of Merkle Hash Tree based authentication data to and from cloud data service provider. Critical analysis of • Storage overhead of DSP the Radix Path Identifiers, a technique for storage of Merkle • Computation overhead of the client Hash Trees in the databases, is presented in this paper. • Storage overhead of the client General Terms The rest of the paper is organized as follows. A basic data Cloud Databases, Security integrity technique is presented in Section 2. The Merkle Hash Tree based data integrity scheme is presented in Sec- Keywords tion 3. Section 4 explains the storage and retrieval technique Database Security, Data Integrity, Outsourced Data for Merkle Hash Tree based authentication data. Section 5 presents our analysis of the Radix Path Identifiers technique and the ongoing work on a new technique. Finally, the con- 1. INTRODUCTION clusions follow in Section 6. Data outsourcing means to store your data on third party cloud data service providers. It is cheaper and easier to 2. BASIC TECHNIQUE maintain the data on a cloud data service instead of main- taining it in data owner’s own premises. Besides all the For the rest of the paper, we assume that there is a mech- benefits, data outsourcing poses numerous security threats anism in place to securely share some data between DO and to the outsourced data. The list includes but is not limited clients. This data could be the public key of the DO or some to data integrity, access privacy and unauthorized access of hash data. Only the DO can modify the data and the clients data. The focus of this paper is data integrity that encom- have read-only access of the data at the DSP. passes completeness, correctness and freshness. The simplest data integrity technique could be to indi- There are three parties involved in these schemes. Data vidually sign all the tuples in a data table and storing the owner (DO), data clients and data service provider (DSP). signatures in a separate column in the same data table. Af- A DSP provides all the data services and can be trusted terwards, on query of client, this signature can be sent to a with the server availability, timely backups, replications and client along with the tuple data. Clients can then check the disaster recovery. But the DSP cannot be trusted with the integrity of the data by verifying the signature of the DO for integrity of outsourced data. A DSP has unlimited access the associated tuple. This scheme poses a huge computation to the data to make it possible for the DSP to forge the overhead for the DO and the clients. Despite the computa- data in anyway. It is assumed that the link between the DO tional overhead, it has a linear storage overhead as a distinct signature needs to be stored with each tuple. Still, attacks are possible on this scheme. DSP can delete some valid tu- ples from the data and the client would never be able to establish this fact. DSP can send an incomplete data set to the client and this forgery will also go undetected at client’s end. Data integrity schemes can be divided into two main cat- egories, i.e., probabilistic and deterministic. Probabilistic 27th GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 26.05.2015 - 29.05.2015, Magdeburg, Germany. approaches for data integrity have been suggested in [7, 11, Copyright is held by the author/owner(s). 12]. The proposed techniques do not require any changes 66 Figure 2: B+ Tree of order 3 Figure 1: Merkle Hash Tree the receiver. The receiver can calculate the H(2) from data block 2. H(1,2) can then be calculated by using the received at the DSP end, but sometimes the integrity results can be H(1) and calculated H(2). In the same way, H(1,4) can be doubtful, as evident from the name. calculated and then H(1,8). The receiver then can compare The second category consists of deterministic approaches the calculated H(1,8) with the already shared H ´(1,8) and that generally base on Authenticated Data Structures (ADS) if both the hashes match then the integrity of data block 2 [6, 5, 10, 1, 2]. ADS based schemes will be the focus of the is confirmed. rest of the paper. Some important facts regarding Merkle’s Signature Scheme are as follows: 3. MERKLE HASH TREES • Security of this signature scheme depends on the secu- Authenticated Data Structures is a technique in which rity of the hash function. some kind of authentication data is stored on the DSP. On the client’s query, a DSP returns the queried data along • Only one hash needs to be maintained/shared securely. with some extra authentication data that is then used by the client to verify the authenticity of returned data. • To authenticate any data block only log2 n hashes need Numerous techniques have been proposed that utilizes to be transfered, where n denotes total number of data ADS for checking the data integrity. Signature aggrega- blocks. tion based schemes have been proposed in [6, 5]. These approaches require to modify signatures of all the records, • In case of integrity checking of a continuous range of which renders it impractical considering the number of sig- blocks, even less than log2 n hashes need to be trans- natures [10]. The authenticated skip lists based approach fered. has been proposed in [1]. A modified Merkle Hash Tree (MHT) based scheme has been proposed in [2] named super- 3.2 B+ Trees efficient data integrity scheme. In this scheme the main B+ trees are a special case of B trees as shown in Figure MHT is divided into smaller MHTs and the root hashes of 2. They are n-ary trees. The root node can be a leaf node these sub-trees are signed. The purpose of the division in or it can be an internal node. Internal nodes only hold keys, smaller MHTs is to avoid the unnecessary calculation up to they do not hold data. Data always stays in the leaf nodes. the root of the main hash tree. Leaf nodes are connected through pointers to form a kind of The Merkle Hash Tree based data integrity techniques for linked list. This linkage helps in sequential traversal of the outsourced data are based on a signature scheme proposed data. by Merkle in [4]. This scheme eliminated the need of digital Let n be the order of a B+ tree. The root node can hold signatures for the data integrity purposes. MHT based data 1 to n-1 keys when root node is the only node in the tree. integrity techniques are based on two sub-components, i.e., If root node is an internal node then it can have 2 to n child Merkle’s Signature Scheme and B+ Trees. nodes. Internal nodes can have dn/2e to n child nodes. Leaf nodes can hold dn/2e to n-1 keys [8]. 3.1 Merkle’s Signature Scheme Merkle proposed a Signature Scheme based on a binary 3.3 Data Integrity based on Merkle Hash Tree tree of hashes in [4]. Figure 1 shows a typical example of Data integrity schemes based on MHT have been designed an MHT. Each leaf node holds the hash of a data block, by replacing binary trees with B+ trees in original Merkle’s e.g.,H(1) holds the hash of the data block 1. Internal nodes Signature Scheme. The B+ tree presented in Figure 2 is hold the hash of the concatenated hashes of their children used with some modifications. Leaf nodes are linked with e.g. H(1,2) = H(H(1) | H(2)) where ’|’ indicates concate- direct pointers. Besides keys, leaf nodes also hold the hashes nation. This scheme is based on the assumption that a of the data records pointed by corresponding keys. As an safe/trusted way exists to share the root of the tree between example, the leaf node 20, 30 also holds the hashes of the the signer and the verifier. To verify the integrity of any data records of the keys 20 and 30. Internal nodes’ pointers data block, the whole tree of hashes does not need to be also hold the hashes of the concatenated hashes of its child transmitted to the verifier. A signer transmits the hashes of nodes. Like right pointer of the internal node 20 holds the only those nodes which are involved in the authentication hash of the concatenated hashes of the data records pointed path of the data block under consideration. For example, by keys 20 and 30 i.e. H(20,30) = H(H(20) | H(30)). if the receiver needs to verify the integrity of data block 2 Security of Merkle Hash Tree based data integrity schemes then only H(1), H(3,4) and H(5,8) need to be transfered to depend on the security of the hash function as in original 67 Table 1: Employee Data Table ID Name Salary 10 Alice 1000 20 Bob 2000 30 Cindy 3000 40 Dan 3000 50 Eva 2000 60 Felix 1000 Figure 4: MHT with Radix Path Identifiers  l if l == 0 rpi = (1) rpiparent ∗ rb + i if l > 0 For example, to calculate the RPI of the key 60 in the leaf node, the level of the key is determined. The level is not zero so the lower part of the equation is applicable and also note Figure 3: Merkle Hash Tree based on Table 1 that all the calculations done are based on ternary number system. i in this case is 1 as 60 is the second key in the leaf node. RPI of the parent is 12 and the rb is 3. Multiplying Merkle’s signature scheme. This scheme resolves the fresh- rpiparent with rb gives 120 and adding i into it gives 121, so ness issue of the query results, too. Each time a DO updates the RPI of the key 60 in leaf node is 121. the data in the DSP, a new root hash is calculated based on The proposed Radix Path Identifier scheme has several the newly updated state of the data. By sharing the new important properties: root hash with the clients, freshness can be ensured. 1. RPIs are continuous in nodes, but not continuous among two consecutive nodes. For example, the base-3 num- 3.4 Implementation Issues bers 10, 11, 12 are continuous but 110 and 120 are A problem associated with the MHTs based data integrity not continuous as shown in Figure 4. schemes is the efficient storage and retrieval of MHTs in the DSP’s database. Numerous approaches exist to store hier- 2. From an RPI, we can easily find the RPI of its par- archical or tree like data in a database, e.g., adjacency list, ent pointer based on the fact that rpiparent equals to nested set, nested interval and closure table etc. Each ap- brpi/rb c . proach has its pros and cons. For this specific problem of 3. From the RPI in a node, we can easily calculate the storing MHTs, a new technique named Radix Path Identi- min and max RPIs in the node, which are (brpi/rb c) ∗ fiers has been proposed in [10]. rb and (brpi/rb c) ∗ rb + (rb − 1). 4. From an RPI in a node, we can easily compute the 4. RADIX PATH IDENTIFIER index i of the pointer or key in the node, which is rpi Consider a data table named Employee as shown in Table mod rb . 1. A Merkle Hash Tree is created based on the data in Employee table as shown in Figure 3. Data is inserted in 4.1 MHT Storage in the Database the MHT in an ascending order. Fanout of this B+ tree is Two models have been suggested in [10] for storage of three that’s why every node holds either one or two keys. Radix Path Identifiers based MHTs in the database. The The basic idea is to assign numbers based on a radix to first method is to store the whole data in one authentication each pointer of the internal node and each key of the leaf table called Single Authentication Table (SAT). The second node in order to uniquely identify them in a MHT. Radix method is to store each level of MHT in an individual table could be any number equal to or greater than fanout of called Level Based Authentication Table (LBAT). the MHT. We take 3 as the radix for the MHT created for Employee table. 4.1.1 Single Authentication Table Radix path identifiers have been added to the MHT shown In this technique, one table holds the entire authentica- in Figure 3 and the modified MHT is shown in 4. Radix Path tion data as shown in Table 2. A tuple in this table rep- Identifier of a pointer or key (in leaf node) depends upon its resents either a pointer in an internal node or a key in a level in MHT and position in a node. Let l be the level of leaf node of the MHT. The authentication table has four the MHT. The level of root node is 0 and the level of leaf columns named as ID, RPI, Hash and Level. The ID col- nodes is the maximum. rb is the radix base. f denotes the umn in authentication table corresponds to the values in ID fanout of the MHT. i denotes the index of a pointer or a column in Employee table. In case of leaf nodes, each key key in a node, ranging from 0 to f. Radix Path Identifier corresponds to a tuple in the Employee table so the map- rpi can be computed using the following equation: ping is straight forward. However in case of internal nodes, 68 Table 2: Single Authentication Table (SAT) Table 3: Level Based Authentication Tables (LBAT) ID RPI Hash Level Emp 1 -1 0 hash 2 ID RPI Hash 30 1 hash 2 Emp 2 (Root) -1 0 hash -1 0 hash 1 ID RPI Hash 20 1 hash 20 1 hash 1 -1 0 hash -1 3 hash -1 3 hash 1 30 1 hash 40 4 hash 40 4 hash 1 50 5 hash 50 5 hash 1 10 0 hash 0 Employee (Leaf Nodes) 20 3 hash 0 ID Name Salary RPI Hash 30 9 hash 0 10 Alice 1000 0 hash 40 12 hash 0 20 Bob 2000 3 hash 50 15 hash 0 30 Cindy 3000 9 hash 60 16 hash 0 40 Dan 3000 12 hash 50 Eva 2000 15 hash 60 Felix 1000 16 hash the number of pointers is always 1 more than the number of keys. Because of that one pointer is stored in the table with -1 as the ID. Rest of the pointers are saved with the Authentication data extracted from LBAT is used to com- IDs of the corresponding keys in the node. Considering the pute the root hash of the MHT. For data extraction from left most pointer in each internal node as an extra pointer, LBAT table, four different ways have been presented i.e. -1 is assigned in the ID column of these pointers. The Hash Multi-Join, Single-Join, Zero-Join and Range-Condition in column holds the hashes associated with the pointers in in- [9]. In all the following methods, data will be extracted from ternal nodes and keys in leaf nodes. LBATs to verify the authenticity of data record with ID 40. The RPI holds the Radix Path Identifier of the pointer All the required data involved in the authentication path of in internal node or key in leaf node. RPIs shown in Figure ID 40 also needs to be extracted from the LBAT and the 4 are unique because these numbers are written in base 3 pointers involved in authentication path are marked with with their preceding zeros. However, for storing RPIs in black color in Figure 4. authentication table, preceding zeros are ignored and the RPIs are converted into base 10 numbers. This results in 4.2.1 Multi-Join mapping of different base 3 numbers to the same base 10 In the Multi-Join approach, all the authentication data numbers. For example, 011 in third level and 11 in second from respective LBATs are retrieved in a single query. Fol- level transform to the same base 10 number i.e. 4. Con- lowing SQL statement retrieves the authentication data of sequently, the transformed RPIs are unique in a level but record with ID 40. In order to fetch all the authentication they can be repeated among different levels of the tree. In data in one query, multiple left outer joins have been used order to distinguish between the same RPIs, a Level column which introduces redundancy in the result. is added to the authentication table. select a0.RPI as RPI0, a0.hash as hash0, 4.1.2 Level Based Authentication Table a1.RPI as RPI1, a1.hash as hash1, In the LBAT technique, authentication data for each level a2.RPI as RPI2, a2.hash as hash2 of an MHT is stored in an individual table. Besides this, an from Employee emp extra table is created that holds the data about the authen- left join Employee a0 on a0.RPI/3 = emp.RPI/3 tication tables i.e. name of the table and its associated level. left join Emp_1 a1 on a1.RPI/3 = emp.RPI/(3*3) LBATs for the MHT shown in Figure 4 are shown in Table 3. left join Emp_2 a2 on a2.RPI/3 = emp.RPI/(3*3*3) As every LBAT table represents one level in the tree, there is where emp.ID = 40; no need to have a column Level in the authentication tables. Level 0 authentication table has exactly the same number of 4.2.2 Single-Join records as the Employee table so both tables can be merged In the Single-Join approach, data from each authentica- to form one. RPI and Hash columns have been added at tion table is retrieved separately. As ID column does not the end of Employee table to hold the authentication data exist in authentication tables that’s why in each query, the for Level 0. authentication table has been joined with the Employee ta- ble on column RPI. 4.1.3 Performance comparison between both schemes Considering the table level locks during updates and in- select e0.RPI, e0.hash serts, it is easier/faster to update authentication data in from Employee emp LBAT than SAT. In the LBAT, as authentication data is left outer join Employee e0 on e0.RPI/3 = emp.RPI/3 stored along with the data record, it makes it straight for- where ID = 40; ward to retrieve authentication data for the leaf level along with the required table data. select e1.RPI, e1.hash from Employee emp 4.2 Authentication Data Extraction left outer join Emp_1 e1 on e1.RPI/3 = emp.RPI/(3*3) 69 where emp.ID = 40; select e2.RPI, e2.hash from Employee emp left outer join Emp_2 e2 on e2.RPI/3 = emp.RPI/(3*3*3) where emp.ID = 40; 4.2.3 Zero-Join As evident from the name, no tables are joined for query- ing authentication data. Each table is queried individually. To query each table without any joins, the RPI of the record under consideration have to be extracted first and stored in some variable. Afterwards this stored RPI is used to extract authentication data from the LBATs. declare @rpid as int; Figure 5: Updating a Record select @rpid = RPI from Employee where ID = 40; select RPI, hash from Employee where RPI/3=@rpid/3; than the authentication of a single record. Let suppose, our range query returns a set of records from ID 20 to 40. User select RPI, hash from Emp_1 where RPI/3=@rpid/(3*3); needs to find the two bounding values of the range under consideration. In this case, the two bounding values would select RPI, hash from Emp_2 where RPI/3=@rpid/(3*3*3); be 10 and 50. The user just needs to fetch the range of records from ID 20 to 40 without their authentication data. 4.2.4 Range-Condition In order to authenticate this range, only the authentication Execution of queries presented in Zero-Join section scans data of the two bounding values is required, i.e., 10 and 50. RPI for each query. This scan can be replaced with index By verifying the data integrity of the bounding values, user seek by creating an index on RPI and replacing the Zero- can be assured of the data integrity of the whole range. Join queries with Range-Conditions. Following queries show how to utilize an index created on RPI and efficiently query 4.3.2 Update data from LBAT for authentication of data record ID 40. Update is a more complicated operation than Select. In updating a record, along with updating the data table, the declare @rpid as int; user also needs to update the hashes in all the records in- select @rpid = RPI from Employee where ID = 40; volved in the authentication path of the updated record. For example, if user updates the data record with ID 60, hash select RPI, hash from Employee of this record will be updated in the Employee table along where RPI >= (@rpid/3)*3 with the data update. In addition user needs to update the and RPI < (@rpid/3)*3+3; hash data in the pointers marked as 12 and 1 as shown in Figure 5. select RPI, hash from Emp_1 where RPI >= (@rpid/(3*3))*3 4.3.3 Insert & Delete and RPI < (@rpid/(3*3))*3+3; Insert & Delete are more complicated operations than up- dating a record. Insert & Delete could affect the associated select RPI, hash from Emp_2 MHT in three different ways. Simplest case could be that where RPI >= (@rpid/(3*3*3))*3 the insertion or deletion of a record effects only a single leaf and RPI < (@rpid/(3*3*3))*3+3; i.e. a key can be added or deleted from a leaf node, in this case only the data involved in the authentication path of the Each of the above given queries retrieve authentication data affected leaf node needs to be updated. A little more com- from a specific level of the MHT. The range condition spec- plicated case could be the change in a subtree of the MHT, ified in the above queries encompasses all the RPIs of the in this case all the authentication records of that subtree elements present in a node. For example, at level three, a needs to be updated. In addition the authentication path of node can have following RPIs i.e. 120, 121 and 122. In the the root of the updated subtree also needs to be updated. RPIs, the last digit always stays less than the fanout of the The most complex case could be the addition or deletion of tree that is why 3 is mentioned as the upper bound in the a new level in the MHT. In case of addition of a new level query. following needs to be done: 4.3 Data Operations 1. Addition of a new LBAT table in the database for the Four data operations, i.e. select, insert, update and delete, newly inserted level in the MHT. will be discussed. 2. Information regarding the new LBAT table needs to 4.3.1 Select be inserted in the table that holds the data about the Selection of a single record along with its associated au- LBATs. thentication data has already been discussed in detail. Au- thentication of a continuous range of records is a bit different 3. Update of data in all the LBATs. 70 5. ANALYSIS & ONGOING WORK 7. REFERENCES [1] G. Di Battista and B. Palazzi. Authenticated 5.1 Analysis relational tables and authenticated skip lists. In In this section, we analyze the properties of Radix Path Proceedings of the 21st Annual IFIP WG 11.3 Identifiers and identify our next steps based on it. Merkle Working Conference on Data and Applications Hash Tree based data integrity technique guarantees all three Security, pages 31–46, Berlin, Heidelberg, 2007. aspects of data integrity, i.e., completeness, correctness and Springer-Verlag. freshness [3]. And in doing so, it completely avoids digital [2] M. T. Goodrich, R. Tamassia, and N. Triandopoulos. signatures which pose a lot of computation overhead. Anal- Super-efficient verification of dynamic outsourced ysis of the Radix Path Identifier technique is as follows: databases. In Proceedings of the 2008 The • It is assumed that only the DO can change the data. Cryptopgraphers’ Track at the RSA Conference on Topics in Cryptology, CT-RSA’08, pages 407–424, • All the tests are performed on traditional DBMS i.e. Berlin, Heidelberg, 2008. Springer-Verlag. SQL Server [10]. NoSQL databases may perform dif- [3] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin. ferently. Dynamic authenticated index structures for • Cached technique for Update results in the lowest over- outsourced databases. In Proceedings of the 2006 ACM head. Without caching of data at DO’s end, the over- SIGMOD International Conference on Management of head can go up to 100%. Data, SIGMOD ’06, pages 121–132, New York, NY, USA, 2006. ACM. • Insert at the end of the table gives better result than the Insert at the beginning of the table. Both cases [4] R. C. Merkle. A certified digital signature. In poses significant overhead. No results have been pub- Advances in Cryptology - CRYPTO ’89, 9th Annual lished for Delete. International Cryptology Conference, Santa Barbara, California, USA, August 20-24, 1989, Proceedings, • In order to modify the data, DO either has to download pages 218–238, 1989. the whole copy of the table along with authentication [5] M. Narasimha and G. Tsudik. Authentication of data or has to keep the whole data cached at DO’s outsourced databases using signature aggregation and premises. chaining. In In International Conference on Database Systems for Advanced Applications (DASFAA, pages 5.2 Ongoing Work 420–436. DASFAA, 2006. We are currently starting to work on a Data Integrity [6] H. Pang, A. Jain, K. Ramamritham, and K.-L. Tan. Technique that is based on Merkle Hash Trees and Radix Verifying completeness of relational query results in Path Identifiers. We want to achieve following goals in this data publishing. In Proceedings of the 2005 ACM new technique: SIGMOD International Conference on Management of • Multiple users should be able to manipulate data. Data, SIGMOD ’05, pages 407–418, New York, NY, • A user should not be required to keep a copy of the USA, 2005. ACM. data in order to modify the data because keeping a [7] R. Sion. Query execution assurance for outsourced copy of data eliminates the purpose of data outsourc- databases. In Proceedings of the 31st International ing. Conference on Very Large Data Bases, VLDB ’05, pages 601–612. VLDB Endowment, 2005. • Communication overhead should be minimized to a [8] A. L. Tharp. File Organization and Processing. John level near to performing operations in a normal database. Wiley & Sons, Inc., New York, NY, USA, 1988. For instance, to insert a row of data, a single insert [9] W. Wei and T. Yu. Practical integrity assurance for statement should be sent to the database. big data processing deployed over open cloud, 2013. • The technique is being designed keeping in view the [10] W. Wei and T. Yu. Integrity assurance for outsourced NoSQL database concepts, too. databases without DBMS modification. In Data and Applications Security and Privacy XXVIII - 28th 6. CONCLUSIONS Annual IFIP WG 11.3 Working Conference, DBSec Numerous techniques have been proposed in the litera- 2014, Vienna, Austria, July 14-16, 2014. Proceedings, ture to check the data integrity of the outsourced data to pages 1–16, 2014. untrusted clouds. Focus of our paper was on Merkle Hash [11] M. Xie, H. Wang, J. Yin, and X. Meng. Integrity Tree based techniques. Despite a lot of research on MHT auditing of outsourced data. In Proceedings of the 33rd based techniques, still insertion and deletion pose a lot of International Conference on Very Large Data Bases, communication and computation overhead. We have also VLDB ’07, pages 782–793. VLDB Endowment, 2007. discussed a technique named Radix Path Identifiers to store [12] M. Xie, H. Wang, J. Yin, and X. Meng. Providing and retrieve authentication data in the DSP’s database. We freshness guarantees for outsourced databases. In plan to design a new technique to eliminate the shortcomings Proceedings of the 11th International Conference on of the current data integrity techniques. The main purpose Extending Database Technology: Advances in Database of our technique is to avoid keeping or fetching the copy of Technology, EDBT ’08, pages 323–332, New York, NY, whole data in order to run an insert or update statement. USA, 2008. ACM. We are also going to simplify the process of inserting and deleting the data the way we are used to do in traditional DBMSs. 71