Secure Data Processing at Scale Kajetan Maliszewski supervised by Prof. Volker Markl Technische Universitat Berlin maliszewski@tu-berlin.de ABSTRACT them to not trade privacy. As most cloud solutions do not Although the cloud is today a de-facto standard for scal- treat privacy as a first-class citizen, users end up working on able data processing, there are still many applications that their premises sacrificing scalability and efficiency. This, for cannot make use of the cloud due to data or computa- example, is the case of most applications in the healthcare tion privacy. Sensitive data, such as in the health do- domain, which use in-house solutions and infrastructure [8]. main; and computations, such as core-business AI pipelines, Even though, hybrid-cloud has recently appeared as a pos- grew into valuable assets that made secure data process- sible solution to this problem [17, 18], sensitive data and ing a hot topic in industry and academia. On one hand, computations still cannot be moved from the private cloud. the existing data processing systems prioritize performance This is because existing data processing systems lack fea- and, to a certain level, trade users’ privacy. On the other tures for preserving the privacy of data and computations. hand, privacy-preserving data processing systems sacrifice Although few systems work on encrypted data [14, 15], most performance. In this PhD thesis, we envision a fully secure cloud-based data processing systems, such as Spark and general-purpose data processing system for the cloud. Over- Flink, expose data and computations at the hardware level. all, we aim at devising: (i) algorithms that are adequate to The research community proposed using trusted execution work with very limited memory, such as the one exposed environments (TEEs) [12] to provide solutions to this prob- by trusted execution environments; (ii) scalable state man- lem [7, 9, 11, 16, 19]. Other works have also used oblivious agement techniques; (iii) oblivious data-access algorithms; algorithms to provide stronger security for TEEs by hiding and (iv) privacy-preserving query optimizations techniques data access patterns [7, 19]. Nevertheless, all these solu- to speed up query execution. tions suffer from several of the following problems: they (i) lack basic performance optimizations; (ii) do not scale out; (iii) cannot support stateful operators nor large state 1. INTRODUCTION information; and (v) are ad-hoc to specific cases. Processing data on the cloud has become omnipresent in Therefore, despite all these efforts, we are still missing a our days [2]. For example, services, such as Amazon AWS holistic solution that could be a panacea to all the afore- and Microsoft Azure, have made trivial for companies, re- mentioned concerns. The Holy Grail would be to replicate searchers, and organizations (users for short) to set up and the success of general-purpose distributed data processing maintain compute nodes. The cloud has given unprece- systems for secure, scalable cloud data processing. Users dented power to users: they can now run applications and should focus on the logic of their applications while the sys- analytics before they were not able to run. For instance, a tem should take care of running and scaling out such applica- small company can easily offer scalable data analytics with- tions efficiently and without any data/computation leakage. out owning its data infrastructure [1]. To the best of our knowledge, there is no general-purpose However, there are still many applications from different system that provides support for fully secure and scalable domains that cannot fully benefit from the cloud. Among cloud data processing. these, we mainly find users working with sensitive data, Building such a system is particularly challenging for e. g., on medical or transactional data, and users per- many reasons. First, users must be able to easily define pri- forming sensitive computations, e. g., machine/deep learn- vacy constraints over their data and computations. Second, ing pipelines defining the core business of a company. These we have to revisit data processing algorithms and state man- users typically have to classify their data or computations agement techniques to work within secure environments. and hence are subject to strict compliance rules that force Third, the system must ensure data and computation pri- vacy on public compute nodes without sacrificing perfor- mance. Fourth, it is not clear how the system optimizes queries in the cloud when privacy is a first-class citizen. In this thesis, we plan to tackle the above research chal- lenges and lay down the foundations of the foreseen general- purpose system. We plan to proceed as follows: we will first build a single node secure and efficient data processing en- Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo, gine; we will follow with making our data processing engine Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted distributed and scalable, by considering public and trusted for private and academic purposes. nodes; we will focus on devising a privacy-preserving query traffic bottleneck). It also does not protect against access- optimizer. In summary, we plan to make the following major pattern attacks. TrustedDB [5] is a secure database built contributions: on a cryptographic coprocessor, an older trusted hardware 1. We will devise an efficient general-purpose data pro- architecture. It stores large state externally and accesses cessing engine for TEEs. In particular, we will propose it using a Paging Module, which exposes the access pat- new data processing algorithms that are adequate to terns. EnclaveDB [11] is a database utilizing SGX that only work with very limited memory (provided by TEEs handles state up to the size of the enclave memory (approxi- environments). mately 90 MB). The authors assume that the future releases of SGX would support much larger memory, however, up 2. We will extend our data processing engine to support until the current release it has not happened. stateful operators. We will particularly provide both The problem of tiny memory has been previously ad- oblivious data-access algorithms and support for large dressed in small footprint databases [6, 10]. They propose state information in TEEs environments. lightweight solutions, however, drastically limiting function- ality for their very specific use cases, i. e., handheld comput- 3. We will propose bidirectional data anonymization al- ers, and smartcards. gorithms as well as data processing algorithms being able to work over encrypted data. Data Access Privacy. ObliDB [7] is an oblivious database core engine for general workloads. It hides access patterns 4. We will then devise different privacy-preserving query using oblivious query processing algorithms that require a optimizations techniques to speed up query execution. full table scan for each query. However, it runs only on a In the remainder of this paper, we first define the prob- single node in an SGX environment. Opaque [19] executes lem we plan to tackle in this thesis in Section 2. We present encrypted Spark jobs using oblivious access. It proposes a related work in Section 3. In Section 4, we depict our envi- query optimizer to mitigate the cost of obliviousness, how- sioned solution. We follow, in Section 5, with different open ever, it still reaches performance degradation of up to 46x. challenges we must tackle to make our envisioned solution a Query Optimization. The existing systems utilizing reality. Lastly, we conclude this paper in Section 6. hybrid-cloud [17, 18] perform the optimizations based on manual tagging the data by the users with its sensitivity. 2. PROBLEM STATEMENT Later, only the insensitive data is processed in the public Efficient and fully secure data processing on the cloud is cloud. This is cumbersome for the users and harmful for currently not possible at the terabyte scale. Guaranteeing workloads consisting mainly of sensitive data. In contrast robustness and consistent privacy level requires (i) usage of to these works on hybrid-cloud, we aim at sending sensitive novel hardware technologies for low-level security, (ii) re- data to the public cloud and omit the tagging process by designing existing approaches for new environments, and leveraging secure hardware. (iii) efficient secure query processing engine. Data and com- putations can only be fully protected using technologies such as TEEs, combined with oblivious data access. Most of the 4. OUR VISION existing execution approaches cannot be easily mapped to the new runtimes due to heavy limitations that TEEs en- To overcome the problem stated in Section 2, we envision a force on the users. Data processing algorithms have to be system comprising a master node and three types of compute redesigned having these limitations in mind. Most impor- nodes: (i) private compute nodes (a fully trusted worker), tantly, compute resources have to be wisely fully utilized (ii) trusted/secure compute nodes (guaranteed with certifi- to guarantee high query execution efficiency. Moreover, cates of trust), (iii) public compute nodes (a fully untrusted worker nodes need to be classified by the level of security machine). We believe that these abstractions ideally fulfill they require; private compute nodes are considered secure, the performance needs while maintaining strong privacy. trusted/secure compute nodes can prove their security but A user submits a query to the master node. In turn, the require data anonymization for privacy, and public compute master node parses the query and optimizes it by exploiting nodes need oblivious processing in a TEE. the knowledge about the topology and available resources. It Therefore, the problem, and main challenge, resides in then compiles, executes, and monitors the query. Note that how to enable efficient and truly secure data processing jobs it is the compute nodes that carry out the actual execution. to hybrid compute environments. Figure 1 depicts the execution of a query over the three types of machines, namely private, trusted/secure, and pub- lic compute nodes. Each node has a rigorous way of process- 3. RELATED WORK ing the query enforced by the query execution plan. Pri- Efficient Execution & State Management. Sanctu- vate compute nodes are allowed to process the input data in ary [16] is a distributed streaming system. The authors clear, i. e., neither encrypted nor anonymized (green arrow). present a set of algorithms to manage large state, but they Trusted/secure compute nodes are considered trustworthy. blindly spill to disk the state information. Additionally, the Hence, they can process data as a private node but might state is not managed obliviously, hence, the longer the job also be forced to process anonymized data depending on its runs, the more information leaks out. SecureStreams [9] trustworthy degree (orange arrow). Public compute nodes is a lightweight streaming platform running in SGX en- are simply considered insecure. They thus receive enclave- claves. The jobs are build using a set of simple operators encrypted data and process it inside a TEE (red arrow). but the system lacks optimizations (unnecessary encryp- Executing a query in such environments is far from being tion/decryption between each operator) and does not scale- trivial as data might be transferred from one kind of com- out well (inter-operator communication quickly becomes a pute nodes to another. For example, Figure 1 illustrates Figure 1: Overall architecture of our envisioned system: green arrows represent traffic in plain text, orange arrows represent anonymized data, and red arrows represent data encrypted with the enclave key. such data transfers with different arrow colors: while pass- 5.2 State Management ing from orange to green means data de-anonymization, red The state is an essential element of an operator. During to green stands for decryption with the enclave key. At the execution, it is used for storing metadata and intermedi- end of the query, the data is aggregated and sent to output ate results, e. g., a rolling aggregation while scanning a ta- as defined by the query. ble. Handling large state management efficiently for enclave- enabled operators is challenging because of the extremely 5. RESEARCH CHALLENGES small main memory capacity in TEEs (see Section 5.1). Building a system as described in Section 4 comes with Existing systems store large states in the memory out- several research challenges, mainly around data access pri- side of the enclave[5, 16]. However, constant paging, and vacy, efficient query execution, state management, and query thus, data encryption and decryption, imposes great per- optimization. We elaborate on each of these in the following. formance deterioration. For example, TrustedDB [5] uses a custom-built Paging Module that stores all pages outside of 5.1 Efficient query execution the Secure Coprocessor. The pages are pulled on-demand Efficiently executing a query in TEEs is quite challenging as needed by the query processing engine. Thoma et al. because of the extremely small main memory capacity in in [16] propose stateful operators for stream processing us- TEEs, expensive CPU instruction set, and no system calls. ing SGX’s built-in paging mechanism. Yet, both systems are For example, Intel’s proprietary TEE technology (SGX) de- not sufficiently optimized for secure hardware. For example, fines an enclave, a private and highly protected region in in [16], a hash-join operator blindly spills the hash table to memory that cannot be accessed from outside of its process. the outside memory, leading to expensive calls outside the It places the enclave code and data in a special memory enclave whenever it looks for a tuple. area of 128 MB. Yet, excluding space for the SGX metadata, Thus, new state management techniques are required for there is approximately only 90 MB left for the application. settings where the main memory is drastically limited. We As a result, the design of state-of-the-art data processing will investigate new data structures and data indexes that operations (e. g., a join operator) cannot simply be mapped allow for state management outside the enclave while at the to enclave-enabled versions. For example, consider the case same time reducing the unnecessary outside calls. of a hash-join operator. In the build phase, this operator takes the smaller table and builds the hash table for the 5.3 Data Access Privacy selected key. Once the table reaches 90 MB in size, it will Although TEEs (such as SGX) provide secure data pro- start spilling the records to the memory outside the enclave cessing via hardware, it is still possible to have data leak- in an encrypted form. These data spilling operations cause age by understanding the data access patterns done by the expensive calls to the CPU due to the context-switching operators inside the enclave. There have thus been many instructions and costly encryption. proposals on how to achieve data access privacy in systems Therefore, efficiently executing queries in TEEs requires designed for specific use-cases, e. g., homomorphic encryp- a radical change in the design of data processing operators. tion [14] and oblivious algorithms [7, 19]. We will investigate new techniques for cache management However, it is a perpetuating problem how to glue these inside the enclaves and optimizations on CPU instruction proposals together to provide a general-purpose data pro- set specifically for relational algebra. To speed up query cessing engine with oblivious data access and without hurt- execution even further, we will investigate the use of query ing performance. For instance, Opaque [19] provides obliv- compilation techniques to generate highly optimized code ious data processing on SGX, but comes with an overhead for TEEs. Additionally, we will examine parallel enclaves of up to 46x! Similarly, ObliDB [7] comes with a huge over- execution on multi-core CPUs. We will design data process- head as it performs a full table scan for each query to achieve ing operators relying on these new techniques. obliviousness. We will explore different ways of reducing such overhead 8. REFERENCES incurred by current oblivious data access algorithms. Par- [1] AWS Startup Stories. https://aws.amazon.com ticularly, we plan to investigate how to set a lower-bound on /campaigns/aws-startups-stories/. the number of non-relevant tuples that are necessary to hide [2] Microsoft’s Growth ReAzuring Under Nadella. access patterns. Such a lower-bound guarantee is required https://markets.businessinsider.com to not compromise privacy while not harming performance. /news/stocks/microsoft-s-growth-reazuring-under- Another research direction we will explore is to store inside nadella-1028372914. the enclave metadata about table values (similar to Oracle’s [3] Oracle Database Concepts. https://docs.oracle.com. Zone Maps [3]). Having such knowledge ahead of a query [4] P. Antonopoulos, A. Arasu, K. Eguro, J. Hammer, can even prevent scanning a table at all. R. Kaushik, D. Kossmann, R. Ramamurthy, and 5.4 Privacy-Preserving Query Optimization J. Szymaszek. Pushing the limits of encrypted databases with secure hardware. arXiv preprint Intel SGX is notorious for its performance degradation arXiv:1809.02631, 2018. and vulnerability to some attacks [4]. End-to-end encryp- tion and data de/anonymization are known to be compu- [5] S. Bajaj and R. Sion. TrustedDB: A Trusted tationally intensive. Inter-cloud data transfers have been Hardware Based Database with Privacy and Data identified as a bottleneck in cloud computing [18]. These Confidentiality. In SIGMOD, 2011. are some of the aspects the query optimizer will consider [6] C. Bobineau, L. Bouganim, P. Pucheral, and when deciding which data is to be executed on which nodes. P. Valduriez. PicoDBMS: Scaling down database Existing systems reduce the problem space by reducing techniques for the smartcard. In VLDB, 2000. their functionality [7, 13]. However, while VC3 [13] is not [7] S. Eskandarian and M. Zaharia. ObliDB: oblivious using oblivious access and thus it is vulnerable to data access query processing for secure databases. PVLDB, 2019. pattern attacks, ObliDB [7] runs only in an SGX-enabled [8] L. Griebel, H.-U. Prokosch, F. Köpcke, environment. Essentially, all these systems are designed for D. Toddenroth, J. Christoph, I. Leb, I. Engel, and homogeneous environments, i. e., they assume all compute M. Sedlmayr. A scoping review of cloud computing in nodes in the system provide a TEE environment. This is healthcare. BMC Med. Inf. & Decision Making, 2015. not the case in our envisioned system where multifarious [9] A. Havet, R. Pires, P. Felber, M. Pasin, R. Rouvoy, machines with different requirements co-habit together in and V. Schiavoni. Securestreams: A reactive the same system. Enforcing a unified policy across all of middleware framework for secure data stream them would end up in performance degradations. processing. In DEBS, 2017. To solve this challenge, we will study the performance [10] J. S. Karlsson, A. Lal, C. Leung, and T. Pham. IBM of connecting operators with different privacy constraints. DB2 everyplace: A small footprint relational database This will add a new dimension to the query optimization system. In ICDE, 2001. and will force us to rethink the optimization techniques [11] C. Priebe, K. Vaswani, and M. Costa. Enclavedb: A for logical and physical query plans. Moreover, we will in- secure database using SGX. In S&P, 2018. vestigate new techniques for data anonymization and de- [12] M. Sabt, M. Achemlal, and A. Bouabdallah. Trusted anonymization, data processing over encrypted data, and Execution Environment: What It is, and What It is query optimization for heterogeneous secure environments. Not. In Trustcom/BigDataSE/ISPA, 2015. [13] F. Schuster, M. Costa, C. Fournet, C. Gkantsidis, 6. CONCLUSION M. Peinado, G. Mainar-Ruiz, and M. Russinovich. We presented our plan towards a general-purpose secure VC3: Trustworthy data analytics in the cloud using and scalable data processing system for hybrid environ- SGX. In S&P, 2015. ments (i. e., composed of private, trusted, and public com- [14] J. J. Stephen, S. Savvides, V. Sundaram, M. S. pute nodes). We showed that each of the existing systems Ardekani, and P. Eugster. STYX: stream processing solves only a piece of the problem. State-of-the-art solu- with trustworthy cloud-based execution. In SoCC, tions force users to work with sensitive data or computations 2016. only within a private environment, hence underutilizing the [15] S. D. Tetali, M. Lesani, R. Majumdar, and available computing resources. Moreover, some works trade T. Millstein. MrCrypt: Static analysis for secure cloud users’ privacy or drastically limit the use-case, e. g., to a computations. In OOPSLA, 2013. centralized system not suitable for very large datasets. We [16] C. Thoma, A. J. Lee, and A. Labrinidis. Behind showed that we are still missing a system that sees the bigger enemy lines: Exploring trusted data stream processing picture and solves the problem of truly secure data process- on untrusted systems. In CODASPY, 2019. ing at scale. We presented our envisioned system to solve [17] X. Xu and X. Zhao. A framework for privacy-aware such a problem and discussed the main research challenges computing on hybrid clouds with mixed-sensitivity that we must tackle to make our vision a reality. data. In HPCC/CSS/ICESS, 2015. [18] K. Zhang, X. Zhou, Y. Chen, X. Wang, and Y. Ruan. 7. ACKNOWLEDGMENTS Sedic: privacy-aware data intensive computing on This work was funded by the German Ministry for Edu- hybrid clouds. In CCS, 2011. cation and Research as BIFOLD - Berlin Institute for the [19] W. Zheng, A. Dave, J. G. Beekman, R. A. Popa, J. E. Foundations of Learning and Data (ref. 01IS18025A and ref. Gonzalez, and I. Stoica. Opaque: An oblivious and 01IS18037A). encrypted distributed analytics platform. In NSDI, 2017.