Secure Data Processing at Scale

                                                         Kajetan Maliszewski
                                                   supervised by Prof. Volker Markl
                                                     Technische Universitat Berlin
                                                      maliszewski@tu-berlin.de


ABSTRACT                                                                     them to not trade privacy. As most cloud solutions do not
Although the cloud is today a de-facto standard for scal-                    treat privacy as a first-class citizen, users end up working on
able data processing, there are still many applications that                 their premises sacrificing scalability and efficiency. This, for
cannot make use of the cloud due to data or computa-                         example, is the case of most applications in the healthcare
tion privacy. Sensitive data, such as in the health do-                      domain, which use in-house solutions and infrastructure [8].
main; and computations, such as core-business AI pipelines,                     Even though, hybrid-cloud has recently appeared as a pos-
grew into valuable assets that made secure data process-                     sible solution to this problem [17, 18], sensitive data and
ing a hot topic in industry and academia. On one hand,                       computations still cannot be moved from the private cloud.
the existing data processing systems prioritize performance                  This is because existing data processing systems lack fea-
and, to a certain level, trade users’ privacy. On the other                  tures for preserving the privacy of data and computations.
hand, privacy-preserving data processing systems sacrifice                   Although few systems work on encrypted data [14, 15], most
performance. In this PhD thesis, we envision a fully secure                  cloud-based data processing systems, such as Spark and
general-purpose data processing system for the cloud. Over-                  Flink, expose data and computations at the hardware level.
all, we aim at devising: (i) algorithms that are adequate to                 The research community proposed using trusted execution
work with very limited memory, such as the one exposed                       environments (TEEs) [12] to provide solutions to this prob-
by trusted execution environments; (ii) scalable state man-                  lem [7, 9, 11, 16, 19]. Other works have also used oblivious
agement techniques; (iii) oblivious data-access algorithms;                  algorithms to provide stronger security for TEEs by hiding
and (iv) privacy-preserving query optimizations techniques                   data access patterns [7, 19]. Nevertheless, all these solu-
to speed up query execution.                                                 tions suffer from several of the following problems: they
                                                                             (i) lack basic performance optimizations; (ii) do not scale
                                                                             out; (iii) cannot support stateful operators nor large state
1.    INTRODUCTION                                                           information; and (v) are ad-hoc to specific cases.
   Processing data on the cloud has become omnipresent in                       Therefore, despite all these efforts, we are still missing a
our days [2]. For example, services, such as Amazon AWS                      holistic solution that could be a panacea to all the afore-
and Microsoft Azure, have made trivial for companies, re-                    mentioned concerns. The Holy Grail would be to replicate
searchers, and organizations (users for short) to set up and                 the success of general-purpose distributed data processing
maintain compute nodes. The cloud has given unprece-                         systems for secure, scalable cloud data processing. Users
dented power to users: they can now run applications and                     should focus on the logic of their applications while the sys-
analytics before they were not able to run. For instance, a                  tem should take care of running and scaling out such applica-
small company can easily offer scalable data analytics with-                 tions efficiently and without any data/computation leakage.
out owning its data infrastructure [1].                                      To the best of our knowledge, there is no general-purpose
   However, there are still many applications from different                 system that provides support for fully secure and scalable
domains that cannot fully benefit from the cloud. Among                      cloud data processing.
these, we mainly find users working with sensitive data,                        Building such a system is particularly challenging for
e. g., on medical or transactional data, and users per-                      many reasons. First, users must be able to easily define pri-
forming sensitive computations, e. g., machine/deep learn-                   vacy constraints over their data and computations. Second,
ing pipelines defining the core business of a company. These                 we have to revisit data processing algorithms and state man-
users typically have to classify their data or computations                  agement techniques to work within secure environments.
and hence are subject to strict compliance rules that force                  Third, the system must ensure data and computation pri-
                                                                             vacy on public compute nodes without sacrificing perfor-
                                                                             mance. Fourth, it is not clear how the system optimizes
                                                                             queries in the cloud when privacy is a first-class citizen.
                                                                                In this thesis, we plan to tackle the above research chal-
                                                                             lenges and lay down the foundations of the foreseen general-
                                                                             purpose system. We plan to proceed as follows: we will first
                                                                             build a single node secure and efficient data processing en-
Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo,
                                                                             gine; we will follow with making our data processing engine
Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted   distributed and scalable, by considering public and trusted
for private and academic purposes.
nodes; we will focus on devising a privacy-preserving query       traffic bottleneck). It also does not protect against access-
optimizer. In summary, we plan to make the following major        pattern attacks. TrustedDB [5] is a secure database built
contributions:                                                    on a cryptographic coprocessor, an older trusted hardware
     1. We will devise an efficient general-purpose data pro-     architecture. It stores large state externally and accesses
        cessing engine for TEEs. In particular, we will propose   it using a Paging Module, which exposes the access pat-
        new data processing algorithms that are adequate to       terns. EnclaveDB [11] is a database utilizing SGX that only
        work with very limited memory (provided by TEEs           handles state up to the size of the enclave memory (approxi-
        environments).                                            mately 90 MB). The authors assume that the future releases
                                                                  of SGX would support much larger memory, however, up
     2. We will extend our data processing engine to support      until the current release it has not happened.
        stateful operators. We will particularly provide both        The problem of tiny memory has been previously ad-
        oblivious data-access algorithms and support for large    dressed in small footprint databases [6, 10]. They propose
        state information in TEEs environments.                   lightweight solutions, however, drastically limiting function-
                                                                  ality for their very specific use cases, i. e., handheld comput-
     3. We will propose bidirectional data anonymization al-
                                                                  ers, and smartcards.
        gorithms as well as data processing algorithms being
        able to work over encrypted data.                         Data Access Privacy. ObliDB [7] is an oblivious database
                                                                  core engine for general workloads. It hides access patterns
     4. We will then devise different privacy-preserving query    using oblivious query processing algorithms that require a
        optimizations techniques to speed up query execution.     full table scan for each query. However, it runs only on a
   In the remainder of this paper, we first define the prob-      single node in an SGX environment. Opaque [19] executes
lem we plan to tackle in this thesis in Section 2. We present     encrypted Spark jobs using oblivious access. It proposes a
related work in Section 3. In Section 4, we depict our envi-      query optimizer to mitigate the cost of obliviousness, how-
sioned solution. We follow, in Section 5, with different open     ever, it still reaches performance degradation of up to 46x.
challenges we must tackle to make our envisioned solution a       Query Optimization.            The existing systems utilizing
reality. Lastly, we conclude this paper in Section 6.             hybrid-cloud [17, 18] perform the optimizations based on
                                                                  manual tagging the data by the users with its sensitivity.
2.     PROBLEM STATEMENT                                          Later, only the insensitive data is processed in the public
   Efficient and fully secure data processing on the cloud is     cloud. This is cumbersome for the users and harmful for
currently not possible at the terabyte scale. Guaranteeing        workloads consisting mainly of sensitive data. In contrast
robustness and consistent privacy level requires (i) usage of     to these works on hybrid-cloud, we aim at sending sensitive
novel hardware technologies for low-level security, (ii) re-      data to the public cloud and omit the tagging process by
designing existing approaches for new environments, and           leveraging secure hardware.
(iii) efficient secure query processing engine. Data and com-
putations can only be fully protected using technologies such
as TEEs, combined with oblivious data access. Most of the         4.   OUR VISION
existing execution approaches cannot be easily mapped to
the new runtimes due to heavy limitations that TEEs en-              To overcome the problem stated in Section 2, we envision a
force on the users. Data processing algorithms have to be         system comprising a master node and three types of compute
redesigned having these limitations in mind. Most impor-          nodes: (i) private compute nodes (a fully trusted worker),
tantly, compute resources have to be wisely fully utilized        (ii) trusted/secure compute nodes (guaranteed with certifi-
to guarantee high query execution efficiency. Moreover,           cates of trust), (iii) public compute nodes (a fully untrusted
worker nodes need to be classified by the level of security       machine). We believe that these abstractions ideally fulfill
they require; private compute nodes are considered secure,        the performance needs while maintaining strong privacy.
trusted/secure compute nodes can prove their security but            A user submits a query to the master node. In turn, the
require data anonymization for privacy, and public compute        master node parses the query and optimizes it by exploiting
nodes need oblivious processing in a TEE.                         the knowledge about the topology and available resources. It
   Therefore, the problem, and main challenge, resides in         then compiles, executes, and monitors the query. Note that
how to enable efficient and truly secure data processing jobs     it is the compute nodes that carry out the actual execution.
to hybrid compute environments.                                      Figure 1 depicts the execution of a query over the three
                                                                  types of machines, namely private, trusted/secure, and pub-
                                                                  lic compute nodes. Each node has a rigorous way of process-
3.     RELATED WORK                                               ing the query enforced by the query execution plan. Pri-
Efficient Execution & State Management. Sanctu-                   vate compute nodes are allowed to process the input data in
ary [16] is a distributed streaming system. The authors           clear, i. e., neither encrypted nor anonymized (green arrow).
present a set of algorithms to manage large state, but they       Trusted/secure compute nodes are considered trustworthy.
blindly spill to disk the state information. Additionally, the    Hence, they can process data as a private node but might
state is not managed obliviously, hence, the longer the job       also be forced to process anonymized data depending on its
runs, the more information leaks out. SecureStreams [9]           trustworthy degree (orange arrow). Public compute nodes
is a lightweight streaming platform running in SGX en-            are simply considered insecure. They thus receive enclave-
claves. The jobs are build using a set of simple operators        encrypted data and process it inside a TEE (red arrow).
but the system lacks optimizations (unnecessary encryp-           Executing a query in such environments is far from being
tion/decryption between each operator) and does not scale-        trivial as data might be transferred from one kind of com-
out well (inter-operator communication quickly becomes a          pute nodes to another. For example, Figure 1 illustrates
Figure 1: Overall architecture of our envisioned system: green arrows represent traffic in plain text, orange
arrows represent anonymized data, and red arrows represent data encrypted with the enclave key.


such data transfers with different arrow colors: while pass-     5.2    State Management
ing from orange to green means data de-anonymization, red           The state is an essential element of an operator. During
to green stands for decryption with the enclave key. At the      execution, it is used for storing metadata and intermedi-
end of the query, the data is aggregated and sent to output      ate results, e. g., a rolling aggregation while scanning a ta-
as defined by the query.                                         ble. Handling large state management efficiently for enclave-
                                                                 enabled operators is challenging because of the extremely
5.    RESEARCH CHALLENGES                                        small main memory capacity in TEEs (see Section 5.1).
  Building a system as described in Section 4 comes with            Existing systems store large states in the memory out-
several research challenges, mainly around data access pri-      side of the enclave[5, 16]. However, constant paging, and
vacy, efficient query execution, state management, and query     thus, data encryption and decryption, imposes great per-
optimization. We elaborate on each of these in the following.    formance deterioration. For example, TrustedDB [5] uses a
                                                                 custom-built Paging Module that stores all pages outside of
5.1   Efficient query execution                                  the Secure Coprocessor. The pages are pulled on-demand
   Efficiently executing a query in TEEs is quite challenging    as needed by the query processing engine. Thoma et al.
because of the extremely small main memory capacity in           in [16] propose stateful operators for stream processing us-
TEEs, expensive CPU instruction set, and no system calls.        ing SGX’s built-in paging mechanism. Yet, both systems are
For example, Intel’s proprietary TEE technology (SGX) de-        not sufficiently optimized for secure hardware. For example,
fines an enclave, a private and highly protected region in       in [16], a hash-join operator blindly spills the hash table to
memory that cannot be accessed from outside of its process.      the outside memory, leading to expensive calls outside the
It places the enclave code and data in a special memory          enclave whenever it looks for a tuple.
area of 128 MB. Yet, excluding space for the SGX metadata,          Thus, new state management techniques are required for
there is approximately only 90 MB left for the application.      settings where the main memory is drastically limited. We
   As a result, the design of state-of-the-art data processing   will investigate new data structures and data indexes that
operations (e. g., a join operator) cannot simply be mapped      allow for state management outside the enclave while at the
to enclave-enabled versions. For example, consider the case      same time reducing the unnecessary outside calls.
of a hash-join operator. In the build phase, this operator
takes the smaller table and builds the hash table for the        5.3    Data Access Privacy
selected key. Once the table reaches 90 MB in size, it will         Although TEEs (such as SGX) provide secure data pro-
start spilling the records to the memory outside the enclave     cessing via hardware, it is still possible to have data leak-
in an encrypted form. These data spilling operations cause       age by understanding the data access patterns done by the
expensive calls to the CPU due to the context-switching          operators inside the enclave. There have thus been many
instructions and costly encryption.                              proposals on how to achieve data access privacy in systems
   Therefore, efficiently executing queries in TEEs requires     designed for specific use-cases, e. g., homomorphic encryp-
a radical change in the design of data processing operators.     tion [14] and oblivious algorithms [7, 19].
We will investigate new techniques for cache management             However, it is a perpetuating problem how to glue these
inside the enclaves and optimizations on CPU instruction         proposals together to provide a general-purpose data pro-
set specifically for relational algebra. To speed up query       cessing engine with oblivious data access and without hurt-
execution even further, we will investigate the use of query     ing performance. For instance, Opaque [19] provides obliv-
compilation techniques to generate highly optimized code         ious data processing on SGX, but comes with an overhead
for TEEs. Additionally, we will examine parallel enclaves        of up to 46x! Similarly, ObliDB [7] comes with a huge over-
execution on multi-core CPUs. We will design data process-       head as it performs a full table scan for each query to achieve
ing operators relying on these new techniques.                   obliviousness.
   We will explore different ways of reducing such overhead       8.   REFERENCES
incurred by current oblivious data access algorithms. Par-         [1] AWS Startup Stories. https://aws.amazon.com
ticularly, we plan to investigate how to set a lower-bound on          /campaigns/aws-startups-stories/.
the number of non-relevant tuples that are necessary to hide       [2] Microsoft’s Growth ReAzuring Under Nadella.
access patterns. Such a lower-bound guarantee is required              https://markets.businessinsider.com
to not compromise privacy while not harming performance.               /news/stocks/microsoft-s-growth-reazuring-under-
Another research direction we will explore is to store inside          nadella-1028372914.
the enclave metadata about table values (similar to Oracle’s
                                                                   [3] Oracle Database Concepts. https://docs.oracle.com.
Zone Maps [3]). Having such knowledge ahead of a query
                                                                   [4] P. Antonopoulos, A. Arasu, K. Eguro, J. Hammer,
can even prevent scanning a table at all.
                                                                       R. Kaushik, D. Kossmann, R. Ramamurthy, and
5.4    Privacy-Preserving Query Optimization                           J. Szymaszek. Pushing the limits of encrypted
                                                                       databases with secure hardware. arXiv preprint
   Intel SGX is notorious for its performance degradation
                                                                       arXiv:1809.02631, 2018.
and vulnerability to some attacks [4]. End-to-end encryp-
tion and data de/anonymization are known to be compu-              [5] S. Bajaj and R. Sion. TrustedDB: A Trusted
tationally intensive. Inter-cloud data transfers have been             Hardware Based Database with Privacy and Data
identified as a bottleneck in cloud computing [18]. These              Confidentiality. In SIGMOD, 2011.
are some of the aspects the query optimizer will consider          [6] C. Bobineau, L. Bouganim, P. Pucheral, and
when deciding which data is to be executed on which nodes.             P. Valduriez. PicoDBMS: Scaling down database
   Existing systems reduce the problem space by reducing               techniques for the smartcard. In VLDB, 2000.
their functionality [7, 13]. However, while VC3 [13] is not        [7] S. Eskandarian and M. Zaharia. ObliDB: oblivious
using oblivious access and thus it is vulnerable to data access        query processing for secure databases. PVLDB, 2019.
pattern attacks, ObliDB [7] runs only in an SGX-enabled            [8] L. Griebel, H.-U. Prokosch, F. Köpcke,
environment. Essentially, all these systems are designed for           D. Toddenroth, J. Christoph, I. Leb, I. Engel, and
homogeneous environments, i. e., they assume all compute               M. Sedlmayr. A scoping review of cloud computing in
nodes in the system provide a TEE environment. This is                 healthcare. BMC Med. Inf. & Decision Making, 2015.
not the case in our envisioned system where multifarious           [9] A. Havet, R. Pires, P. Felber, M. Pasin, R. Rouvoy,
machines with different requirements co-habit together in              and V. Schiavoni. Securestreams: A reactive
the same system. Enforcing a unified policy across all of              middleware framework for secure data stream
them would end up in performance degradations.                         processing. In DEBS, 2017.
   To solve this challenge, we will study the performance         [10] J. S. Karlsson, A. Lal, C. Leung, and T. Pham. IBM
of connecting operators with different privacy constraints.            DB2 everyplace: A small footprint relational database
This will add a new dimension to the query optimization                system. In ICDE, 2001.
and will force us to rethink the optimization techniques          [11] C. Priebe, K. Vaswani, and M. Costa. Enclavedb: A
for logical and physical query plans. Moreover, we will in-            secure database using SGX. In S&P, 2018.
vestigate new techniques for data anonymization and de-           [12] M. Sabt, M. Achemlal, and A. Bouabdallah. Trusted
anonymization, data processing over encrypted data, and                Execution Environment: What It is, and What It is
query optimization for heterogeneous secure environments.              Not. In Trustcom/BigDataSE/ISPA, 2015.
                                                                  [13] F. Schuster, M. Costa, C. Fournet, C. Gkantsidis,
6.    CONCLUSION                                                       M. Peinado, G. Mainar-Ruiz, and M. Russinovich.
   We presented our plan towards a general-purpose secure              VC3: Trustworthy data analytics in the cloud using
and scalable data processing system for hybrid environ-                SGX. In S&P, 2015.
ments (i. e., composed of private, trusted, and public com-       [14] J. J. Stephen, S. Savvides, V. Sundaram, M. S.
pute nodes). We showed that each of the existing systems               Ardekani, and P. Eugster. STYX: stream processing
solves only a piece of the problem. State-of-the-art solu-             with trustworthy cloud-based execution. In SoCC,
tions force users to work with sensitive data or computations          2016.
only within a private environment, hence underutilizing the       [15] S. D. Tetali, M. Lesani, R. Majumdar, and
available computing resources. Moreover, some works trade              T. Millstein. MrCrypt: Static analysis for secure cloud
users’ privacy or drastically limit the use-case, e. g., to a          computations. In OOPSLA, 2013.
centralized system not suitable for very large datasets. We       [16] C. Thoma, A. J. Lee, and A. Labrinidis. Behind
showed that we are still missing a system that sees the bigger         enemy lines: Exploring trusted data stream processing
picture and solves the problem of truly secure data process-           on untrusted systems. In CODASPY, 2019.
ing at scale. We presented our envisioned system to solve         [17] X. Xu and X. Zhao. A framework for privacy-aware
such a problem and discussed the main research challenges              computing on hybrid clouds with mixed-sensitivity
that we must tackle to make our vision a reality.                      data. In HPCC/CSS/ICESS, 2015.
                                                                  [18] K. Zhang, X. Zhou, Y. Chen, X. Wang, and Y. Ruan.
7.    ACKNOWLEDGMENTS                                                  Sedic: privacy-aware data intensive computing on
  This work was funded by the German Ministry for Edu-                 hybrid clouds. In CCS, 2011.
cation and Research as BIFOLD - Berlin Institute for the          [19] W. Zheng, A. Dave, J. G. Beekman, R. A. Popa, J. E.
Foundations of Learning and Data (ref. 01IS18025A and ref.             Gonzalez, and I. Stoica. Opaque: An oblivious and
01IS18037A).                                                           encrypted distributed analytics platform. In NSDI,
                                                                       2017.