Provenance Metadata Management in Distributed Storages Using the Hyperledger Blockchain Platform? Andrey Demichev1 , Julia Dubenskaya1 , Elena Fedotova1 , Alexander Kryukov1 , Stanislav Polyakov1 , and Nikolay Prikhod’ko2 1 Skobeltsyn Institute of Nuclear Physics Lomonosov Moscow State University, Moscow, Russia demichev@theory.sinp.msu.ru 2 Yaroslav-the-Wise Novgorod State University, Velikiy Novgorod, Russia niko2004x@mail.ru Abstract. The paper is devoted to the development of principles and operation algorithms of provenance metadata management system, en- titled ProvHL (Provenance HyperLedger), based on the integration of blockchain technology, smart contracts and provenance metadata driven data management. It is fault-tolerant and reliable in terms of the safety and security of provenance metadata records from accidental or inten- tional distortion. Keywords: distributed storage · provenance metadata · blockchain · access rights · Hyperledger Fabric. 1 Introduction This work is aimed at developing design principles of a system for storage and management of provenance metadata for data generated by large-scale scientific experiments. A vivid example of such installations is the Large Hadron Collider (CERN, Geneva; https://home.cern), the time of active work of which and, ac- cordingly, the generation of large scientific data, is several tens of years, and the processing time of the data will be at least twice as much. Without detailed and correct provenance metadata, comparing the results obtained with an interval, for example, of a few years, will be simply impossible. Another important ex- ample is astroparticle physics which has become a data intensive science with many terabytes of data and often with tens of measured parameters associated to each observation (see e.g., [1]). While 10–15 years ago there were 1–10 Tb of data per year in astrophysics, new experimental facilities generate data sets ranging in size from 100’s to 1000’s of terabytes per year. Moreover new highly complex and massively large datasets are expected to be produced in the next decades by novel and more complex scientific instruments as well as results of data simulations needed for physical interpretation. ? This work was funded by the Russian Science Foundation (grant No. 18-11-00075). Although a number of projects have been implemented in recent years to create systems for the support and management of metadata, including the data provenance, but the vast majority of implemented solutions are centralized [2, 3], which is inadequate for the use in distributed environments and the possibil- ity of using metadata by organizationally unrelated or loosely related research communities. The use of centralized solutions in a distributed environment that includes different administrative domains is always associated with the problems of organizing the management of the central service, as well as with ensuring the trust in such a service from the side of participants of the distributed system. In addition, any central service is a bottleneck and a point of failure of the system. On the other hand, in recent years, distributed registries based on blockchain technology [4] have become very popular in various applied areas thanks to a number of important advantages. Just recently, blockchain-based provenance metadata (PMD) management systems have also been developed [5, 6]. Analysis of the proposed solutions shows that they are designed for cloud storage environ- ments, are quite heavyweight and resource-consuming. The latter is related to the specific peculiarities of realization of the distributed registries as blockchains (namely, necessity for block mining). This makes doubtful the prospects for suc- cessful use of these solutions for the storage and management of provenance metadata generated by large scientific experiments in distributed environments. In this paper, we propose an approach to solving this problem based on the use of blockchain technology and smart contracts within the Hyperledger platform (https://www.hyperledger.org) [7]. The basic principles of operation and algorithms of the ProvHL system (Provenance HyperLedger) for managing provenance metadata and data access rights in distributed storages are pre- sented. 2 Basic principles of operation of the system for managing provenance metadata 2.1 Smart contracts The basic scenario of using the proposed system assumes that a virtual orga- nization (VO) is formed for the joint implementation of a certain project. VO includes several real organizations, in turn including data providers, users and data handlers affiliated with them. It is assumed that the implementation of such a project requires the use of a distributed data storage. This distributed storage can be formed by renting multiple cloud storage, as well as integrating the own storage resources of the organizations that form the VO. Thus, the hardware and software basis of the business environment in this case is formed by a set of storages (possibly of different types, e.g., cloud storages, file servers, tape storages, etc.), each of which can be managed by its own data management system (DMS). For certainty, it is further assumed that the data is stored as files, i.e. the file is a unit of data. Generally speaking, several VOs can coexist; the storages with which they interact can form partially overlapping sets. In such an environment, an immutable and distributed (as the environment itself) registry and a consensus on the order of data operations are needed to resolve possible conflicts between the VO/project participants related to the use of the data. Conflicts may be caused by priority issues upon obtaining results of data processing, use of results (including funding issues), interrelations with storage providers, etc. In order to prevent possible conflicts, accurate imple- mentation of mutual data access policies is required. In other words, support of business processes for data sharing and storage is needed. A smart contract along with the registry form the basis of a blockchain system. While the registry contains information about the current and historical state of a set of business objects, a smart contract determines the executable logic that generates new states to be added to the registry. Before parties of a business process can enter into interactions with each other, they must define a common set of contracts covering common terms, data, rules, concept definitions and processes. Taken together, these contracts define a business model that governs all interactions between transactional parties. A smart contract defines these rules between the parties in the form of executable code. 2.2 Permissioned blockchains A natural solution for the establishment of a distributed immutable registry for the provenance metadata (PMD) records is the use of the blockchain technology. The latter guarantees that no records were inserted into the registry in hindsight, no entries were changed in the registry and the registry has never been branched or bifurcated. An important question is how to provide validation of the chain of blocks with transaction records in the case of PMD registry. The use of the most popular proof-of-work (PoW) method [4] on the basis of mining is very resource-intensive, and is poorly suited for management systems for provenance metadata in the case of processing scientific data. Indeed, the calculations that are performed within the framework of PoW themselves do not serve any useful purpose, and this is a principle feature. It is very difficult to come up with a proof of work that would serve a socially useful role. Therefore, if possible, it is better to abandon it. Trying to solve these problems, a community of researchers in this field offers a variety of consensus algorithms that do not require “work”. The choice of the algorithm heavily depends on the way of access to transaction processing. From this point of view, blockchains are classified as follows: – permissionless (public) blockchains, in which there are no restrictions on the transaction handlers; – permissioned blockchains, in which transaction processing is performed by specified entities. Public blockchains are more known because cryptocurrency networks are based on them. In contrast to the permissionless blockchains, in the systems based on permissioned blockchains, the built-in coins are usually not used. Built-in coins are required in permissionless blockchains to provide a reward for pro- cessing transactions. Permissioned blockchains can form a more controlled and predictable environment than public blockchains and does not require calcula- tions related to the PoW algorithms. In the distributed storage environment, the local data management systems, data owners, representatives of real organi- zations participating in the project, etc., can act as the authorized parties that create and sign the blocks. In order to maliciously change a transaction con- firmed by all the authorized parties in the distributed storage environment, the attacker must gain access to all the secret keys of the block handlers. This is very unlikely, and thus this approach provides a high degree of protection for the distributed registry. It is this approach to the construction of the metadata registry that was implemented in our PMD management system. 2.3 Hyperledger blockchain platform To put this solution into practice, it is convenient to use existing blockchain platforms. Analysis of existing platforms shows that the required solution for the PMD management system most naturally can be implemented on the basis of the Hyperledger Fabric permissioned blockchain platform (HLF; hyperledger.org) [7] together with Hyperledger Composer (hyperledger.github.io/composer). The lat- ter is a set of tools to simplify the use of blockchain. Hereafter we shall refer to these two components as HLF&C-platform. To describe the business process within the framework of HLF&C-platform, a number of concepts are used, the main ones are assets, participants, transactions and events. Assets are tangible or intellectual resources, services or property, records of which are kept in the blockchain. Assets must have a unique identifier, but they can also contain any properties defined for them. Participants are members of the business network who can own assets and make transaction requests. They also can have any prop- erties if necessary. Transaction is the mechanism of interaction of participants with assets. Messages about the events can be sent by transaction processors to inform external components of changes in the blockchain. Very important that HLF&C-platform provides the operation of smart contracts (called chaincode), which allows us to organize the business process of sharing storage resources by project participants located in different administrative domains. The suggested ProvHL system for managing provenance metadata is a sophisticated adaptation of the HLF&C-platform for the business process of sharing storage resources. Unlike public blockchain networks, which allow non-authenticated users to participate in their work, members of the HLF&C-network must be registered with Membership Service Provider (MSP), which, among other things, performs the functions of Certification Authority (CA). 2.4 Management of data access rights In addition to the task of recording the immutable history of working with data in a distributed storage environment, the task of providing distributed management of access rights to data is set. For example, the owner of a data file (the user who created the data, the organization to which it belongs through its authorized representative, etc.) must be able to manage access to it for other users. Another example is when a cloud storage service grants access to data stored on it only to users from organizations that have paid for this storage service. Detailed management of rights to initiate transactions related to operations with data in the distributed storage is based on the use of special Hyperledger Composer tools. The rights have to be described in acl-files located in the nodes of the blockchain network with the help of the special Access Control Language (ACL). Modification of the contents of the acl-file is also carried out by initiating the corresponding transaction by duly authorized users. 2.5 Metadata driven data management From the general point of view, two approaches are possible. In the first ap- proach, data management systems (DMS) manage data and use a blockchain simply as a distributed log (data driven data management). In the second ap- proach, the metadata is written to the blockchain beforehand, and DMSs refer to the blockchain and perform the transactions recorded there (metadata driven data management). In the first case, the functionality of the blockchain system is very limited, it only provides a distributed ledger which is resistant to occa- sional or malicious attempts to modify the history of data in distributed stor- age. HLF&C-platform enables one to implement the second approach, which, in addition to simply maintaining the ledger, allows us to solve the problem of distributed data access management. In our case, participants (in the sense of the HLF&C-platform) include per- sons (users and administrators of different levels) and storage providers. The main assets are data files. Their properties (attributes) are provenance meta- data, including local file name in a storage, storage ID, creator ID, file owner ID, type of the file (primary, secondary or replica), etc. Another important type of the assets consists of the (local) storages constituting the distributed envi- ronment. We also defined user groups as assets, because we found it useful for managing data access rights. Finally, operations with files are treated as assets too because each operation actually comprises of a several atomic transactions. The basic operations can be of the following types: new file upload; file down- load; file copy within a storage; file deletion; file copy to another storage; file transfer to another storage. The algorithm which we propose for recording transactions with provenance metadata and managing data access rights in the framework of ProvHL in a very simplified form reads as follows: – the owner accesses the chaincode function, which, according to the acl-file (acl stands for access control language), allows the owner of the data to grant access rights to these data to another user or group of users; – a user who has been granted access rights by the owner accesses the chaincode with a request to make an operation (ClientRequest transaction) with data (for example, file download, upload or copy); – chaincode verifies that such a transaction complies with the rules defined in the acl-file and, if it does, sends a request to the HLF environment to complete the transaction; – HLF performs transaction processing (transaction workflow: simulation and endorsements → ordering → validation → state updating); – HLF sends a message (event) to the user about the successful transaction and its recording in the blockchain; the message also contains the transaction identification number; – the user accesses the data management system (DMS) with a request to perform a data operation that contains the number of the corresponding transaction; – DMS checks for a record of this transaction in the blockchain; – if there is a record of the valid transaction, the DMS performs the required operation and, in turn, initiates a transaction record confirming that a data operation was performed (ServerResponse transaction). As it can be seen, for each data operation, at least two transaction records are made in the blockchain: the first corresponds to the client request (ClientRe- quest), and the second corresponds to the server response (ServerResponse). In general case, an operation comprises of even more transactions. In the simplified description of the algorithm, some details specific to certain types of transactions are omitted for brevity. In particular, when the new file upload operation is per- formed, the creation of the new asset, that is the data file, is performed only after the actual upload of the file in the storage when DMS makes a ServerResponse transaction and turns the uploaded file into a fully valid asset. Together with the above-mentioned splitting of transactions into the client and server parts, this makes the level of correspondence between the history recorded in the blockchain and the real history of the data in the distributed storage practically acceptable. 2.6 Consensus One of the first and most well known consensus algorithms is the Paxos algo- rithm [8]. This algorithm is not designed to work in distributed systems with possible Byzantine errors (malicious distortion of information by nodes). It is very difficult for understanding and implementing. In addition, Paxos uses an approach in which each node (consensus member) interacts with each, so the complexity of the decision is O(n2 ), where n is the number of the nodes. As a result, practical implementations have little to do with Paxos. Each implemen- tation begins with Paxos, detects difficulties in its implementation, and then significantly changes the architecture. It takes a lot of time and leads to errors. Because of these problems, Paxos is not a good choice for building real systems. The Raft algorithm [9] implements the consensus by choosing a single leader, giving it full responsibility for managing the replicated log. The leader accepts requests from consensus members, copies them to other nodes, and tells the rest of the nodes when it is safe to use log entries in their replicated state machines. The idea of having a special leader simplifies the management of the replicated log. If the leader for some reason stops working, the procedure for selecting a new leader begins. However, Raft is also not designed to work in distributed systems in which the Byzantine type of error is possible. The Practical Byzantine Fault Tolerance (PBFT) algorithm [10] was the first practical solution to achieve consensus in the face of Byzantine failures. It uses the concept of replicated state machine, and nodes in a PBFT system are sequen- tially ordered with one node being the leader and others referred to as backup nodes. All nodes in the system communicate with one another with the goal being that all honest nodes will come to an agreement of the state of the system using the majority rule. This algorithm requires 3n + 1 replicas to be able to tolerate n failing nodes. Communication between nodes has two functions: nodes must prove that messages came from a specific peer node, and they must verify that the message was not modified during transmission. This approach imposes a low overhead on the performance of the HLF&C-platform ordering services which are consensus nodes in our case. However messaging overhead in the case of PBFT increases significantly as the number of the ordering nodes increase. However it remains acceptable for a couple of dozens of ordering service nodes (parties in a project using a distributed storage under the ProvHL management). Currently we consider PBFT as a most suitable distributed consensus algorithm. 3 Conclusion In this paper, using the novel approach based on the integration of blockchain technology, smart contracts and metadata driven data management, the prin- ciples and algorithms of the new system, entitled ProvHL (Provenance Hyper- Ledger), have been developed. This system is intended for fault-tolerant, safe and secure management of provenance metadata, as well as of access rights to data in distributed storages. The problems of optimal choice of the blockchain type for such a system, as well as the choice of the blockchain platform are stu- died. Namely, it is proposed to use the permissioned type of blockchain and the Hyperledger blockchain platform, on the basis of which the ProvHL system is implemented. At present, a testbed has been created on the basis of SINP MSU, where a preliminary version of the ProvHL prototype is deployed to implement the developed principles and refine the algorithms of the system. The creation of ProvHL production level system will significantly improve the quality and reli- ability of the results obtained on the basis of processing and analysis of data in a distributed computer environment. References 1. Berghöfer, T., et al.: Towards a model for computing in european astroparticle physics. arXiv preprint, arXiv:1512.00988 (2015) 2. Zafar F., et al.: Trustworthy Data: A Survey, Taxonomy and Future Trends of Secure Provenance Schemes. Journal of Network and Computer Applications 94, 50-68 (2017) 3. da Cruz S. M. S., Campos M. L. M. and Mattoso M.: Towards a Taxonomy of Provenance in Scientific Workflow Management Systems. In: World Conference on Services-I, pp. 259-266. IEEE (2009) 4. Baliga A.: Understanding Blockchain Consensus Models. Tech. rep., Persistent Systems Ltd (2017) 5. Ramachandran A. and Kantarcioglu M.: SmartProvenance: A Distributed, Blockchain Based Data Provenance System. In: CODASPY’18: The 8th ACM Con- ference on Data and Application Security and Privacy. Tempe, AZ, USA (2018) 6. Liang X. et al.: Provchain: A Blockchain-based Data Provenance Architecture in Cloud Environment with Enhanced Privacy and Availability. In: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Com- puting, pp. 468-477. IEEE Press (2017) 7. Androulaki E.,et al.: Hyperledger Fabric: A Distributed Operating System for Per- missioned Blockchains. In: Proceedings of the Thirteenth EuroSys Conference, ar- ticle No. 30. ACM, Porto, Portugal (2018) 8. Lamport, L.: The Part-Time Parliament. ACM Transactions on Computer Systems 16(2) 133–169 (1998) 9. Ongaro, D. and Ousterhout J.K.: In search of an understandable consensus algo- rithm. In: USENIX Annual Technical Conference, pp. 305–319. USENIX Associa- tion (2014) 10. Castro M. and Liskov B.: Practical Byzantine Fault Tolerance. In: Proceedings of the 3rd Symposium on Operating Systems Design and Implementation, pp. 173-186 (1999)