=Paper= {{Paper |id=Vol-1848/CAiSE2017_Forum_Paper4 |storemode=property |title=EthDrive: A Peer-to-Peer Data Storage with Provenance |pdfUrl=https://ceur-ws.org/Vol-1848/CAiSE2017_Forum_Paper4.pdf |volume=Vol-1848 |authors=Xiao Liang Yu,Xiwei Xu,Bin Liu |dblpUrl=https://dblp.org/rec/conf/caise/YuXL17 }} ==EthDrive: A Peer-to-Peer Data Storage with Provenance== https://ceur-ws.org/Vol-1848/CAiSE2017_Forum_Paper4.pdf
    EthDrive: A Peer-to-Peer Data Storage with
                                    Provenance


                  Xiao Liang Yu123 , Xiwei Xu14 , and Bin Liu14
                             Data61, CSIRO, Australia
                               1

                   Carnegie Mellon University, Pittsburgh, USA
                    2

                           mr.y.xiaoliang@ieee.org
               3
                 The University of Melbourne, Melbourne, Australia
     4
       School of Computer Science and Engineering, UNSW, Sydney, Australia
                    {firstname.lastname}@data61.csiro.au



       Abstract.    In this digitally connected world, cloud storage plays an im-
       portant role in allowing users to store, share, and access their les any-
       where. The conventional cloud storage is a centralised system that relies
       on a trusted party to provide all services. Thus, the cloud provider has
       the ability to manipulate the system including, for example, changing the
       user's data and upgrading the infrastructure software without informing
       the users. A centralised authority is also a single point of failure for
       the entire system from the software architecture perspective. In this pa-
       per, we demonstrate EthDrive, which is a peer-to-peer data storage that
       leverages blockchain to provide data provenance. A peer-to-peer archi-
       tecture eliminates the centralised authority and achieves high availability
       since the data is normally replicated on multiple nodes within the net-
       work. The blockchain is used to provide a tamper-proof data provenance
       which could be used to check data integrity. We use an IoT scenario to
       show the feasibility of EthDrive5 .

       Keywords:    Information security · Data storage systems · Internet of
       Things · Distributed databases


1    Introduction
Cloud storage reduces the need for local storage and enables ecient le sharing.
The conventional cloud storage is a centralised system that relies on a trusted
party to provide all services. In such a centralised architecture, rst, data on
cloud storage is readable, modiable, and deletable by the cloud provider. For
example, on 31 Jan 2017, six hours of database data of the famous git repository
manager gitlab.com was deleted by just a simple delete command6 . Second, the
availability of the data on cloud storage is questionable [5] while saving copies to
5
  The demo video for EthDrive is available at http://video.ethdrive.org/latest.
  The alpha version of EthDrive is now available at npm (http://dist.ethdrive.
  org/latest). The website for EthDrive is at http://ethdrive.org.
6
  The announcement from gitlab.com can access at https://about.gitlab.com/
  2017/02/01/gitlab-dot-com-database-incident/




X. Franch, J. Ralyté, R. Matulevičius, C. Salinesi, and R. Wieringa (Eds.):
CAiSE 2017 Forum and Doctoral Consortium Papers, pp. 25-32, 2017.
Copyright 2017 for this paper by its authors. Copying permitted for private and academic purposes.
multiple cloud providers to increase data availability needs a considerable eort
due to vendor lock-in [1].
    In this paper, we propose EthDrive solve those problems. EthDrive is a dis-
tributed storage client which allows users to upload, download, and share les.
It utilizes content-addressable distributed le system as the underlying storage
and blockchain as the distributed database. All les are stored in the distributed
storage while the le records are registered on the blockchain. Files are indexed
by a unique le ID which is associated with the content address, storage system
used, last uploader, last update time, authorised editors, authorised adminis-
trators (people who can add and remove editors), immutability (whether the
le can be updated or deleted), and comments. That information is stored in
the blockchain via a smart contract, which is a program deployed and running
across the blockchain network [7]. Blockchain is an emerging technology that al-
lows participants in an industry ecosystem to transact with each other without
relying on a central trusted authority to record and validate transactions [8].
    Information in EthDrive is secured because only authorised personnel can
mutate particular information. Information like the editor and time of the last
update is assigned by smart contract. Any invalid attempt will be denied by
the blockchain network. Read access control is enabled by encrypting the con-
tent address and/or the le content itself. From the nature of both blockchain
and content-addressable distributed le system, high availability is easily achiev-
able while data integrity is guaranteed. In addition, it provides data provenance
driven by the distributed consensus which is dicult to achieve for conventional
cloud storage without a trust [2]. EthDrive resolves users' most concerns when
using cloud storage [4] - security, control, vendor lock-in, congurability, and
speed to activate new services and expand capacity.
    There are related work which aim to address those problems. Such as, Per-
macoin [6], Sia 7 , and Storj 8 . However, they require currency in the respective
blockchain valuable while EthDrive can leverage multiple blockchains with dif-
ferent mechanisms and benet from pre-existing cryptocurrency ecosystem.

2     Background
Conventional Cloud Storage In the scenario of sharing les using conven-
tional cloud storage, cloud provider stores the data uploaded by the le publisher.
In this model, it is not transparent to the users about how the cloud provider
would deal with those les. File integrity and availability don't often come with
their services [3]. Even le integrity is included in their Service Level Agree-
ment(SLA), le consumers are unable to verify it unless putting extra eorts. For
provenance, we can only trust the cloud provider which can be unfavourable [2].

Peer-to-Peer Data Storage and File Sharing Peer-to-peer technology has
been used for distributed data storage and le sharing. Such a system allows
7
  https://sia.tech/
8
    https://storj.io/




                                      26
                                Fig. 1. Overview of EthDrive

                   Blockchain network                          P2P file system network




                               Connect to network via [ethdrive start]


                                             Ethdrive

            •    init                                                        •    init
            •    start                                                       •    get
            •    upload                                                      •    start
            •    comment                                                     •    query
                                                                             •    download
                                                                             •    comment




                    File publisher                               File consumer




users to access data that is stored in other computers connected to the same
peer-to-peer network. A centralised server is not required. Existing products
include BitTorrent 9 , IPFS (InterPlanetary File System) 10 and etc. All these
Peer-to-peer techniques use dierent mechanisms to share data with peers and to
replicate data across nodes. IPFS is an open source content-addressable, globally
(peer-to-peer) distributed le system for sharing a large volume of data with high
throughput. IPFS has several limitations, including 1) originality of les cannot
be veried, and 2) limited mutability.
Blockchain      The blockchain data structure is a time-stamped list of blocks.
Blocks are containers that aggregate transactions. The blockchain provides an
immutable data storage where existing transactions cannot be updated or deleted.
Cryptography and digital signatures are used to prove identity and authen-
ticity. For a basic description and technical details of blockchain, please refer
to [8]. Smart contracts are programs running on blockchain. All interactions
with smart contracts are recorded immutably, and veried and accepted by the
whole blockchain. Ethereum11 is the most widely-used blockchain that supports
general-purpose (Turing-complete) smart contracts at the time of this research
being conducted.
 9
   http://www.bittorrent.com/
10
   https://ipfs.io/
11
  https://www.ethereum.org/




                                             27
3   EthDrive
By using blockchain as a reliable distributed database with content-addressable
peer-to-peer storage, EthDrive guarantees that the les to be downloaded from
storage providers are intact. EthDrive is currently based on Ethereum and IPFS.
However, it does not couple with these specic implementations. EthDrive is
known to facilitate scenarios with no trusted third party or the le content must
be identical to the one uploaded by the indicated le uploader. Fig. 1 gives
an overview of EthDrive with all the available commands. File publisher and
le consumer need to congure the environment via ethdrive init and start
daemon process of EthDrive via ethdrive start before other operations.
    Users can upload and download les via ethdrive upload and ethdrive
download. A user can use ethdrive upload to upload a le, then announce the
le ID to allow others to locate the le. A user can also share a le without up-
loading it when he/she knows the IPFS address of the le via ethdrive upload
with --not_provide as argument.
    Every le is retrieved via ethdrive download based on a customised le ID,
which is assigned by the le publisher who uploads the rst version of the le.
File ID is logical and understandable. Conceptually, a le ID is associated with
information including the information publisher, a purpose, and the data related
to that purpose. Using dierent naming space is achieved by using other smart
contracts with the same interface.
    Read access control is implemented by uploading an encrypted le content
address and/or an encrypted le content using symmetrical encryptions. Only
personnel having the correct key can download and/or decrypt the le content.
Write access control is enforced by the smart contract, which implements a
permission control based on the blockchain account addresses. When a user calls
the smart contract to modify the le record for a particular le ID, the smart
contract checks whether the user has the permission based on the blockchain
account public key and the corresponding signature.
    Users could add and get comments via ethdrive comment to/from a le
record if the le publisher enabled this feature. Every comment has three infor-
mation associated, including the le ID it comments on, the comment author
and the comment content. Since all comments are stored in the blockchain, the
provenance of the comments is assured in the same way as le records. The com-
ment mechanism enables exible interactions between the le author and the
comment author. For example, a group of people uses EthDrive to publish an
article. Comments are accepted. There are reviewers being invited to review this
article. While everyone can post their comments on to the le record, the authors
will only concern about the reviewers' reviews (in the form of a comment). By
checking the comment authors, those reviews can be identied, authenticated,
and conrmed to be intact without a centralised party.
    Any node in the network can become a provider of a certain le via ethdrive
provide. Also, people who download the data becomes a temporary provider.
The Distributed Hash Table (DHT) of IPFS is able to connect them when a
correct data content address is given.




                                     28
3.1     Properties

Data Integrity Data integrity is achieved because 1) the content address of a
le will change when the le is modied and 2) the content address is stored on
the blockchain and can only be updated by authorised users.
Data Availability Both blockchain and content-addressable distributed le
system are based on the peer-to-peer network. Users of EthDrive are in the
Ethereum blockchain network and IPFS DHT network at the same time. There
are two types of le provider in the IPFS network, including permanent providers
and temporary providers. Permanent providers are nodes that provide the les.
They might be the le publishers or storage provider who have some contracts
with the le publishers to help them serve the les. Temporary providers are the
ones that have downloaded the le. They will provide the le until the cache is
cleaned. There are nodes providing other les which share some common blocks
with the le. Such nodes are also providers of parts of the le.
Traceability The editor and the time of every update are recorded in the smart
contract and accepted by the whole blockchain network. Thus, the transactions
recorded on the blockchain provide the provenance of the le record. Using the
provenance, le consumers are able to conrm that the content is the one that
originally comes from the intended le publisher.
Customisability EthDrive is not coupled with the current implemented smart
contract. It can work with the smart contract that implements the dened in-
terfaces. This feature allows dierent logics to be used making EthDrive highly
customisable to t dierent situations while modication of the client program
is not needed.

3.2     Limitations

Limitations of blockchain and IPFS apply to EthDrive. For example, Ethereum
blockchain needs to be synchronised before using EthDrive. The database is
growing without a bound - synchronisation can be time-consuming and the
database might occupy a large space. This can be solved by deploying the light
client protocol version of Ethereum which is under development at the time
the research is being conducted. The download speed of IPFS can be slow down
sometimes because the DHT routing of IPFS is still immature at the time the re-
search is being conducted. However, the famous 51% attack12 on the blockchain
will only aect the data availability property of EthDrive while other main prop-
erties are intact.

4      IoT Use Case
This section gives a detailed example to show how EthDrive can improve the
quality of data storing in the context of Internet of Things.
12
     The general term refers to a situation where malicious miners have more mining
     power than honest miners do




                                       29
                                                       Blockchain network

                                                                                          Data
             IoT                                   IoT                                  provider
           gateway                               gateway
                               Cloud                                                      Data
                                                    IoT                                 consumer
              IoT
            gateway                               gateway


                                                              P2P file system network

      a) IoT using conventional cloud              b) IoT using EthDrive

                             Fig. 2. IoT Data Storing Architecture


    Fig. 2 a) shows IoT using conventional cloud storage, which includes sensors,
IoT gateways, and a cloud storage. Sensors are used to collect data. IoT gateways
are presented to aggregate sensor data, translate sensor protocols and process
sensor data before data is sent to the cloud storage. We assume IoT gateways
don't have enough storage for storing the data generated. Cloud storage providers
are responsible for storing and distributing the data generated from the sensors.
    Fig. 2 b) gives a conceptual architecture of IoT using EthDrive. Nodes of
the peer-to-peer network are made up of IoT gateways, storage providers, and
data consumers. EthDrive provides the provenance of the data and guaranteed
data integrity to increase the condence for the data from the data consumers'
perspective. The following shows the steps of how EthDrive is used.
Registration on the network Storage provider registers itself to the network
by uploading (ethdrive upload) an empty le named IOT_IDS. The le is
used to store the ID of all the IoT gateways which the storage provider decides
to provide the les they generate and upload using EthDrive. The le ID repre-
sents the storage provider ID on the network. IoT gateway registers itself to the
network by uploading an empty le named DATA_IDS. The le is used to store
the le ID of all the les the IoT gateway generates and uploads using EthDrive.
The le ID represents the IoT gateway ID on the network.
Registration of a IoT gateway to a storage provider IoT gateway registers
itself to a storage provider by adding a comment (ethdrive comment) on the le
record whose le ID is the storage provider ID. The storage provider periodically
checks if there is any new comment. If there is, the storage provider then decides
whether accepts the request. If it accepts, the IoT gateway ID will be added
to the le IOT_IDS. The modied le IOT_IDS is then uploaded (ethdrive
upload) to replace the old le.

Data uploading      IoT gateway packs data collected from the sensors into a
le and uploads (ethdrive upload) it using EthDrive with a le ID. The le
ID is added to the le DATA_IDS, then DATA_IDS is uploaded to replace
the old le. The storage provider periodically checks for all the IoT gateway
IDs registered to see whether they have uploaded any new le. If they do, the
storage provider will call ethdrive provide to become a provider of the newly
uploaded les.




                                            30
Access data      EthDrive uses a le ID to locate the data content address in
Blockchain. If it is encrypted, data consumer will supply the correct key to
EthDrive to get (ethdrive get) the data content address. Then, it downloads
the data to the data consumer's local storage. If the data integrity is corrupted,
the data will not be available since the IPFS component of EthDrive won't be
able to locate it by the data content address stored in the blockchain. The time
of upload and the data publisher can be veried by retrieving the le record
information via ethdrive query.

5      Performance Data
Uploading les via ethdrive upload only creates a transaction for storing the
le content address on to the Ethereum blockchain and uploads the le content
to the local IPFS node. Thus, there is no network latency for ethdrive upload.
    We conducted an experiment to evaluate the performance of downloading
les using ethdrive download. In this experiment, we compared the download
performance between EthDrive and a conventional cloud storage, AWS S313 in
US standard region. We deployed four EthDrive nodes in four locations: Hong
Kong, Tokyo (Japan), Sydney (Australia) and Pittsburgh (USA). The locations
are selected so that they are geographically distinct to simulate a global peer-
to-peer network. The test les are generated by a random data generator. We
queried IPFS network and conrmed that there is no block in the test les shared
with any pre-existing le on the network. All the test les are provided by three
providers. An Ethereum private blockchain is used.

                          Fig. 3. The Experiment Result




    Fig. 3 shows the result of this experiment. All the results shown in the gure
are the average values of ten identical tests. The Ethereum lookup time of the
EthDrive for all the tests on average takes 0.038 seconds with a standard devi-
ation of 0.000166. The Ethereum lookup time was almost constant in all tests
13
     https://aws.amazon.com/s3/




                                     31
which is expected because the retrieval of the le content address is done on
the local blockchain data. Most of the time EthDrive spent is for downloading
the les from the IPFS network. The yellow column (Ethereum Lookup Time)
in Fig. 3 is too small to be viewable. In this experiment, EthDrive is faster
than AWS S3 when the le size ≤ 20MB and is slower when the le size >
30MB. This shows that IPFS is not handling large les eciently at the time
of this experiment being conducted. This result suggests that when the le size
is small, additional features provided by EthDrive doesn't come with a perfor-
mance penalty.

6   Conclusion and Future Work
By utilizing blockchain and content-addressable distributed storage, the demon-
strated EthDrive can have all the basic requirements for a reliable cloud storage
fullled. It includes le integrity, reliable permission control, and high availabil-
ity. In addition, EthDrive provides custom le ID as a means to locate the les
to facilitate easy access, sharing and management, and comment mechanism to
increase its versatility. For our future work, the following features/functions can
be implemented on top of the current version: 1) An extension which allows users
to pay to particular parties to have them provide certain les to ensure the min-
imum availability of those les. The payment will be done automatically when
they are providing the le and terminated when they do not. 2) After integrating
with other content-addressable distributed le systems, les can be downloaded
from all the networks concurrently to boost the performance. 3) Mount les l-
tered by some conditions to the le system via Filesystem in Userspace (FUSE)
to facilitate convenient use. We also plan to evaluate EthDrive more systemati-
cally through using additional evaluation criteria.

References
1. H. Abu-Libdeh, L. Princehouse, and H. Weatherspoon. Racs: a case for cloud
   storage diversity. In Proceedings of the 1st ACM symposium on Cloud computing,
   pages 229240. ACM, 2010.
2. M. R. Asghar, M. Ion, G. Russello, and B. Crispo. Securing Data Provenance in
   the Cloud, pages 145160. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
3. I. Ion, N. Sachdeva, P. Kumaraguru, and S. ƒapkun. Home is safer than the cloud!:
   privacy concerns for consumer cloud storage. In Proceedings of the Seventh Sympo-
   sium on Usable Privacy and Security, page 13. ACM, 2011.
4. J. Ju, J. Wu, J. Fu, Z. Lin, and J. Zhang. A survey on cloud storage. JCP,
   6(8):17641771, 2011.
5. B. Mao, S. Wu, and H. Jiang. Improving storage availability in cloud-of-clouds
   with hybrid redundant data distribution. In 2015 IEEE International Parallel and
   Distributed Processing Symposium, pages 633642, May 2015.
6. A. Miller, A. Juels, E. Shi, B. Parno, and J. Katz. Permacoin: Repurposing bitcoin
   work for data preservation. In Security and Privacy (SP), 2014 IEEE Symposium
   on, pages 475490. IEEE, 2014.
7. S. Omohundro. Cryptocurrencies, smart contracts, and articial intelligence. AI
   Matters, 1(2), Dec. 2014.
8. M. Swan. Blockchain: Blueprint for a New Economy. O'Reilly, US, 2015.




                                      32