-

Minimization through Decentralized Data Architectures

Vancouver, Canada

0 Centrum Wiskunde & Informatica , Amsterdam , The Netherlands

In this research project, we investigate an alternative to the standard cloud-centralized data architecture. Specifically, we aim to leave part of the application data under the control of the individual data owners in decentralized personal data stores. Our primary goal is to increase data minimization, i. e., enabling more sensitive personal data to be under the control of its owners while providing a straightforward and eficient framework to design architectures that allow applications to run and data to be analyzed. To serve this purpose, the centralized part of the schema contains aggregating views over this decentralized data. We propose to design a declarative language that extends SQL, for architects to specify diferent kinds of tables and views at the schema level, along with sensitive columns and their minimum granularity level of their aggregations. Local updates need to be reflected in the centralized views while ensuring privacy throughout intermediate calculations; for this we pursue the integration of distributed materialized view maintenance and multi-party computation (MPC) techniques. We ifnally aim to implement this system, where the personal data stores could either live in mobile devices or encrypted cloud storage, in order to evaluate its performance properties.

distributed systems, cloud, declarative language, multiparty computation

approach allows to decentralize sensitive details - such as © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License all detail data to the central application database. Our widespread collection of personal data through central- the next steps in Section 6. data. Full centralisation of detail-level sensitive data in- that an organization collects to the minimum needed

1. Introduction

Organizations almost invariably adopt a data architecture based on centralization for their IT systems, retaining information in analytical stores. Making an organization data-driven, i. e. relying more on data science and machine learning for decision-making, is a driver for expanding data architectures, strengthening the trend of ized and typically cloud-based systems. Such systems facilitate data processing; however, they present concerns pertaining to security and confidentiality. The organization is in control of all the user data, causing owners, often private citizens, to lose control over this sensitive creases the exposure of the organization to ransomware attacks - as well as the needed cloud resources.

In the PhD research plan outlined in this paper, we investigate partially decentralized alternatives to the fully centralized state of afairs, to give users more control over their personal data and reduce processing costs and risks for organizations running applications. We thus aim for a generic infrastructure that pushes the boundaries of data minimization [ 1 ], i. e., leaving more data under the control of its creators/owners while still allowing easy VLDB 2023 PhD Workshop, co-located with the 49th International 2.

Motivation

Data minimization is the principle of limiting the data for the purpose [ 1 ], and is a crucial concept in privacy regulations such as the GDPR.

Still, digital services often

do need private information, so data minimization in itself does not avoid sensitive user data being collected. However, what is necessary for a purely centralized data architecture might not be needed in the partially decentralized data architectures for which we aim to develop a generic infrastructure.

A possible use case considers fitness tracker applications, whose popularity has raised concerns regarding the privacy of collected personal data. For example, this can include user profile information, activity statistics, health metrics, and geographical coordinates. We argue that for showing the top-10 runners in a circuit or the distribution of running times, it is not necessary to bring health metrics and the coordinates of runs - in personal data stores, and only transmit aggregated running data to the central application database. A personal data store The best-known attempt at creating a unified personal is a personal database purely under the user’s control, data store is Tim Berners Lee’s SOLID project1, but this e.g., a local database kept on a personal device, and/or does not envision central analytical queries. We think stored in a separate cloud service; but encrypted with a that giving centralized query services access to analytpersonal key that only the user holds. ics over whole fleets of personal datastores, via privacy

This project aims to reduce sensitive user data storage controlled materialized views, will provide extra value in central databases without compromising user privacy, that may help adoption of the concept of personal data providing future application architects with a simple solu- stores, which until now has been lackluster. tion without impacting their ability to create compelling Prior query-oriented decentralized infrastructures are applications, while giving end users more control over mainly found in specific applications, such as realtheir personal data. Our research may also contribute to time cellular network analytics systems exploiting geoadvances in diferential privacy. partitioning of input data [ 4 ]. Distributed and federated

We focus on the design of a declarative framework in query processing are also the subject of extensive rewhich information architects can use SQL (i. e. relational search. However, these always assume a free choice in database technology to leverage the analytical properties data placement decisions [ 5, 6 ], rather than ascertaining of structured data) to split a data management architec- that personal data is kept private and under the control of ture between a centralized and decentralized part, as well the end-user. Federated query processing systems also asas on a secure implementation of this framework in a sume an online mode of operation. Our assumption that real system, and an evaluation of its eficiency properties. only the end users can access their personal data stores This concept gives users ownership and cryptographic leads to an approach where updates from the user side security over their personal data stores, allowing an end- must trickle to a central query-answering facility later, user inspection of the aggregated queries ordered by the which in turn leads to incremental view maintenance central database. (IVM). IVM has been studied extensively [ 7, 8 ], and we

A major research question concerns privacy- aim to build on this work. However, an additional complipreserving mechanisms for incremental materialized cation is that the incremental maintenance actions may view maintenance [ 2 ] in this setting: we want to hide leak information. Therefore, we study mechanisms confrom the central database what each user’s individual cerning data processing and supporting cryptographic contribution to a sensitive materialized aggregate is. methods in which updates of multiple users are combined

As a platform for RDDA prototyping, we choose the to form more coarse-grained materialized view updates. open-source novel data management system DuckDB [ 3 ], A related research field to protect sensitive data is difofering the ability to run analytical queries eficiently ferential privacy [9], the technique of adding noise to even on low-power devices. However, since our frame- the data to obscure any individual’s information while work strongly relies on SQL, it will be easily portable. maintaining the statistical accuracy of the overall set. We think that our decentralized data architectures are a potential use case to build new forms of decentralized dif3. Research Questions ferential privacy, and note that research in this area tends to assume a central database. We also note that stateThe concepts proposed so far raise a number of research of-art diferential privacy approaches within database questions: systems such as Pinq [10] and Google DP2 work best 1. How can we specify a decentralized data architec- with aggregated data, matching our concept of aggregatture in a declarative language (e. g. a SQL exten- ing materialized views. Finally, there is a relevant time sion), and what properties or constraints should dimension: (i) recency of IVM updates will be part of our it contain? trade-ofs against violating privacy constraints, and (ii) 2. How can we apply cryptographic techniques to data architects may want to limit the lifetime of data in incrementally maintain materialized views in a the materialized views. Therefore, we plan to incorporate manner that does not leak more than understand- specific stream processing elements in our framework. able constraints stipulate, controlling the amount The most relevant research we perceive is the Dataflow of privacy leakage? model [11], which was the first to clearly separate the 3. How can we help establish trust from end users, concepts of data arrival time and event time in stream providing insight and control over personal data processing. We think these notions will be necessary to and attesting that the service or application im- define accuracy metrics of our materialized views. plementing the framework plays by its rules? replicated tables, this involves only the first tier and can be done locally.

All the components will be specified in a declarative manner: application architects are unlikely to be experts in privacy-conscious decentralized data management.

We, therefore, propose our language to be an extension of SQL, hoping to expedite the adoption of our framework.

Decentralized tables represent personal data stores and can only implement references to other local tables.

These can be seen as horizontal partitions (row groups), assigned an implicit identifier at the moment of database Figure 1: The proposed architecture. Decentralized views initialization. Operations can therefore be executed concollect deltas computed from updates, and send aggregated currently on diferent partitions, allowing for improved query results to the central server. Before permanently storing performance and smaller transaction scope, similar to the data, privacy checks are performed to assess whether the Google Spanner architecture [13] but only requiring anonymity can be granted. explicit declaration during the table creation process.

Centralized tables, on the other hand, are under the control of the application architect and formally represent 5. Architecture the union of aggregations over the partitions. Replicated tables can also be defined, containing overviews to be Figure 1 shows our decentralized architecture. The periodically propagated from the second to the first tier, infrastructure includes three components: in the first such as public dashboards. (left), multiple users query and update their personal We introduce an additional concept of decentralized data stores containing only their private data. The per- views, defined over the tables in the personal data store sonal data stores reside in encrypted cloud-based storage, to identify those pieces of data that may be exported whose key belongs to the individual user, and updates centrally. This additional abstraction layer is intended to are reflected there. The second component (middle) is a give end users more insight into what data is centrally secure analytics infrastructure responsible for applying readable. Decentralized views contain the records to deltas and checking whether the collected data respects be communicated to the central entity, which are then privacy constraints without leaking information. The stored in centralized views. third component (right) is a centralized database. In addition, centralized views may introduce time win

Organizations can host online applications using such dows, either in terms of logical (event) time from the data central databases (as is standard now), as well as run or actual update time. Centralized views can therefore be analytical workloads, with the diference that some of this defined to retain only data from a limited number of such data is stored in materialized views of which the detailed windows. The purpose is to aid information architects in underlying data stems from personal data stores, left realizing retention limits directly through a SQL specifiunder user control. Periodically, upserts from aggregate cation. This feature also broadens our research question queries are applied to central materialized views. To toward incremental streaming view maintenance. maintain privacy guarantees, we establish a minimum The column specifications in our table and view definigranularity level, checking whether each group contains tions will allow to e. g. add randomized noise to facilitate a suficient number of elements. Privacy rules can be user building diferential privacy on top of our framework. or system defined: how to choose appropriate bounds They also allow defining sensitive columns, as well as is still an open research question. However, this step minimum aggregation granularity, such as a minimum of must be performed without revealing any content before at least e. g., 100 values for an aggregate result tuple that guaranteeing that information is privacy-preserving: a involves this sensitive column to be included in it. We possible technique is multi-party computation [12]. aim to develop declarative rules to identify potentially

Our infrastructure then relays these results to the cen- privacy-breaking queries, which may necessitate SQL tral server, where data analysts can query centralized extensions to provide an additional layer of security. tables. The framework also periodically transmits cen- The previously described design provides an easy way tral updates to the replicated tables, and general statistics for application architects to specify the components of on the completeness of the incrementally maintained our infrastructure, however, it alone does not guaranviews are available to the central database in order to tee to protect data owners from possible malevolence. give accuracy bounds on query processing. When indi- Aggregate data could still contain sensitive information vidual users of an application, on the other hand, need or not have a suficient level of granularity, failing to query processing on their personal data stores or the provide anonymization. For example, it would be easy to recognize individuals belonging to a group with only one element. However, such calculations cannot be performed until information from multiple PDS is obtained.

In order to mitigate the potential risk of unauthorized access to database records, the ofloading of computations to a third-party entity can be considered. Nonetheless, it is crucial to establish a foundation of trust in these additional service providers. A possible solution is to employ 3-way multi-party computation (MPC), either with diferent cloud providers or a peer-to-peer system, to hide information while it is being processed until it respects our privacy constraints. The state-of-art system Secrecy[14] allows secure collaborative analytics through oblivious SQL queries. We plan to extend this framework with IVM techniques to be able to perform bulk updates and insertions.

Expensive IVM operations such as joins can be performed locally on PDS in plain text; results are then applied to decentralized views and sent over a secure communication channel (TLS) to be ultimately stored in centralized views. Our MPC servers, therefore, only need to append new rows or update the aggregated values in single tables, which can be performed through cheap oblivious arithmetic operations.

Establishing trust in the organization responsible for setting up our infrastructure, ensuring they fulfill their claims, remains a prerequisite in this methodology. This can be achieved through various approaches, including enabling transparency by exposing all server trafic and resource utilization, conducting audits, and leveraging the use of open source technologies. However, our exploration of eficient encryption within incremental view maintenance is ongoing, and we remain open to additional ideas that could enhance our approach.

[1]

Pfitzmann ,

Hansen , A terminology for talking about privacy by data minimization: Anonymity, unlinkability , undetectability, unobservability, pseudonymity, and identity management, 2009 .

[2]

A. K.

Gupta ,

I. S.

Mumick , Maintenance of materialized views: Problems, techniques, and applications , IEEE Data Eng. Bull . 18 ( 1999 ) 3 - 18 .

[3]

Raasveldt ,

Mühleisen , Duckdb: an embeddable analytical database , in: Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019 , Amsterdam, The Netherlands, June 30 - July 5, 2019 , ACM, 2019 , pp.

[4] A. P. I. et. al., Celliq : Real-time cellular network analytics at scale , in: 12th USENIX Symposium on Networked Systems Design and Implementation , NSDI 15 , Oakland, CA, USA, May 4- 6 , 2015 ,

USENIX

Association , 2015 , pp. 309 - 322 .

[5]

Raasveldt ,

Mühleisen , Monetdblite: An embedded analytical database , CoRR abs/ 1805 .08520 ( 2018 ). URL: http://arxiv.org/abs/ 1805 .08520. a r X i v : 1 8 0 5 . 0 8 5 2 0 .

[6]

Mühleisen , Architecture-independent distributed query processing , Ph.D. thesis , Free University of Berlin, 2012 .

[7] Y. A. et. al., Dbtoaster: Higher-order delta processing for dynamic , frequently fresh views , 2012 .

a r X i v : 1 2 0 7 . 0 1 3 7 .

[8] M. B. et. al., Dbsp: Automatic incremental view maintenance for rich query languages , 2022 .

a r X i v : 2 2 0 3 . 1 6 6 8 4 .