<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Minimization through Decentralized Data Architectures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="editor">
          <string-name>Vancouver, Canada</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centrum Wiskunde &amp; Informatica</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this research project, we investigate an alternative to the standard cloud-centralized data architecture. Specifically, we aim to leave part of the application data under the control of the individual data owners in decentralized personal data stores. Our primary goal is to increase data minimization, i. e., enabling more sensitive personal data to be under the control of its owners while providing a straightforward and eficient framework to design architectures that allow applications to run and data to be analyzed. To serve this purpose, the centralized part of the schema contains aggregating views over this decentralized data. We propose to design a declarative language that extends SQL, for architects to specify diferent kinds of tables and views at the schema level, along with sensitive columns and their minimum granularity level of their aggregations. Local updates need to be reflected in the centralized views while ensuring privacy throughout intermediate calculations; for this we pursue the integration of distributed materialized view maintenance and multi-party computation (MPC) techniques. We ifnally aim to implement this system, where the personal data stores could either live in mobile devices or encrypted cloud storage, in order to evaluate its performance properties.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>distributed systems, cloud, declarative language, multiparty computation</p>
      <p>approach allows to decentralize sensitive details - such as
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License all detail data to the central application database. Our
widespread collection of personal data through central- the next steps in Section 6.
data. Full centralisation of detail-level sensitive data in- that an organization collects to the minimum needed</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Organizations almost invariably adopt a data architecture
based on centralization for their IT systems, retaining
information in analytical stores. Making an
organization data-driven, i. e. relying more on data science and
machine learning for decision-making, is a driver for
expanding data architectures, strengthening the trend of
ized and typically cloud-based systems. Such systems
facilitate data processing; however, they present concerns
pertaining to security and confidentiality. The
organization is in control of all the user data, causing owners,
often private citizens, to lose control over this sensitive
creases the exposure of the organization to ransomware
attacks - as well as the needed cloud resources.</p>
      <p>
        In the PhD research plan outlined in this paper, we
investigate partially decentralized alternatives to the fully
centralized state of afairs, to give users more control over
their personal data and reduce processing costs and risks
for organizations running applications. We thus aim for
a generic infrastructure that pushes the boundaries of
data minimization [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], i. e., leaving more data under the
control of its creators/owners while still allowing easy
VLDB 2023 PhD Workshop, co-located with the 49th International
2.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Motivation</title>
      <p>
        Data minimization is the principle of limiting the data
for the purpose [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and is a crucial concept in privacy
regulations such as the GDPR.
      </p>
      <p>Still, digital services often</p>
      <p>do need private information,
so data minimization in itself does not avoid sensitive
user data being collected. However, what is necessary
for a purely centralized data architecture might not be
needed in the partially decentralized data architectures
for which we aim to develop a generic infrastructure.</p>
      <p>A possible use case considers fitness tracker
applications, whose popularity has raised concerns regarding
the privacy of collected personal data. For example, this
can include user profile information, activity statistics,
health metrics, and geographical coordinates. We argue
that for showing the top-10 runners in a circuit or the
distribution of running times, it is not necessary to bring
health metrics and the coordinates of runs - in personal
data stores, and only transmit aggregated running data
to the central application database. A personal data store The best-known attempt at creating a unified personal
is a personal database purely under the user’s control, data store is Tim Berners Lee’s SOLID project1, but this
e.g., a local database kept on a personal device, and/or does not envision central analytical queries. We think
stored in a separate cloud service; but encrypted with a that giving centralized query services access to
analytpersonal key that only the user holds. ics over whole fleets of personal datastores, via
privacy</p>
      <p>
        This project aims to reduce sensitive user data storage controlled materialized views, will provide extra value
in central databases without compromising user privacy, that may help adoption of the concept of personal data
providing future application architects with a simple solu- stores, which until now has been lackluster.
tion without impacting their ability to create compelling Prior query-oriented decentralized infrastructures are
applications, while giving end users more control over mainly found in specific applications, such as
realtheir personal data. Our research may also contribute to time cellular network analytics systems exploiting
geoadvances in diferential privacy. partitioning of input data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Distributed and federated
      </p>
      <p>
        We focus on the design of a declarative framework in query processing are also the subject of extensive
rewhich information architects can use SQL (i. e. relational search. However, these always assume a free choice in
database technology to leverage the analytical properties data placement decisions [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], rather than ascertaining
of structured data) to split a data management architec- that personal data is kept private and under the control of
ture between a centralized and decentralized part, as well the end-user. Federated query processing systems also
asas on a secure implementation of this framework in a sume an online mode of operation. Our assumption that
real system, and an evaluation of its eficiency properties. only the end users can access their personal data stores
This concept gives users ownership and cryptographic leads to an approach where updates from the user side
security over their personal data stores, allowing an end- must trickle to a central query-answering facility later,
user inspection of the aggregated queries ordered by the which in turn leads to incremental view maintenance
central database. (IVM). IVM has been studied extensively [
        <xref ref-type="bibr" rid="ref7 ref9">7, 8</xref>
        ], and we
      </p>
      <p>
        A major research question concerns privacy- aim to build on this work. However, an additional
complipreserving mechanisms for incremental materialized cation is that the incremental maintenance actions may
view maintenance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in this setting: we want to hide leak information. Therefore, we study mechanisms
confrom the central database what each user’s individual cerning data processing and supporting cryptographic
contribution to a sensitive materialized aggregate is. methods in which updates of multiple users are combined
      </p>
      <p>
        As a platform for RDDA prototyping, we choose the to form more coarse-grained materialized view updates.
open-source novel data management system DuckDB [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], A related research field to protect sensitive data is
difofering the ability to run analytical queries eficiently ferential privacy [9], the technique of adding noise to
even on low-power devices. However, since our frame- the data to obscure any individual’s information while
work strongly relies on SQL, it will be easily portable. maintaining the statistical accuracy of the overall set.
We think that our decentralized data architectures are a
potential use case to build new forms of decentralized
dif3. Research Questions ferential privacy, and note that research in this area tends
to assume a central database. We also note that
stateThe concepts proposed so far raise a number of research of-art diferential privacy approaches within database
questions: systems such as Pinq [10] and Google DP2 work best
1. How can we specify a decentralized data architec- with aggregated data, matching our concept of
aggregatture in a declarative language (e. g. a SQL exten- ing materialized views. Finally, there is a relevant time
sion), and what properties or constraints should dimension: (i) recency of IVM updates will be part of our
it contain? trade-ofs against violating privacy constraints, and (ii)
2. How can we apply cryptographic techniques to data architects may want to limit the lifetime of data in
incrementally maintain materialized views in a the materialized views. Therefore, we plan to incorporate
manner that does not leak more than understand- specific stream processing elements in our framework.
able constraints stipulate, controlling the amount The most relevant research we perceive is the Dataflow
of privacy leakage? model [11], which was the first to clearly separate the
3. How can we help establish trust from end users, concepts of data arrival time and event time in stream
providing insight and control over personal data processing. We think these notions will be necessary to
and attesting that the service or application im- define accuracy metrics of our materialized views.
plementing the framework plays by its rules?
replicated tables, this involves only the first tier and can
be done locally.
      </p>
      <p>All the components will be specified in a declarative
manner: application architects are unlikely to be experts
in privacy-conscious decentralized data management.</p>
      <p>We, therefore, propose our language to be an extension of
SQL, hoping to expedite the adoption of our framework.</p>
      <p>Decentralized tables represent personal data stores
and can only implement references to other local tables.</p>
      <p>These can be seen as horizontal partitions (row groups),
assigned an implicit identifier at the moment of database
Figure 1: The proposed architecture. Decentralized views initialization. Operations can therefore be executed
concollect deltas computed from updates, and send aggregated currently on diferent partitions, allowing for improved
query results to the central server. Before permanently storing performance and smaller transaction scope, similar to
the data, privacy checks are performed to assess whether the Google Spanner architecture [13] but only requiring
anonymity can be granted. explicit declaration during the table creation process.</p>
      <p>Centralized tables, on the other hand, are under the
control of the application architect and formally represent
5. Architecture the union of aggregations over the partitions. Replicated
tables can also be defined, containing overviews to be
Figure 1 shows our decentralized architecture. The periodically propagated from the second to the first tier,
infrastructure includes three components: in the first such as public dashboards.
(left), multiple users query and update their personal We introduce an additional concept of decentralized
data stores containing only their private data. The per- views, defined over the tables in the personal data store
sonal data stores reside in encrypted cloud-based storage, to identify those pieces of data that may be exported
whose key belongs to the individual user, and updates centrally. This additional abstraction layer is intended to
are reflected there. The second component (middle) is a give end users more insight into what data is centrally
secure analytics infrastructure responsible for applying readable. Decentralized views contain the records to
deltas and checking whether the collected data respects be communicated to the central entity, which are then
privacy constraints without leaking information. The stored in centralized views.
third component (right) is a centralized database. In addition, centralized views may introduce time
win</p>
      <p>Organizations can host online applications using such dows, either in terms of logical (event) time from the data
central databases (as is standard now), as well as run or actual update time. Centralized views can therefore be
analytical workloads, with the diference that some of this defined to retain only data from a limited number of such
data is stored in materialized views of which the detailed windows. The purpose is to aid information architects in
underlying data stems from personal data stores, left realizing retention limits directly through a SQL
specifiunder user control. Periodically, upserts from aggregate cation. This feature also broadens our research question
queries are applied to central materialized views. To toward incremental streaming view maintenance.
maintain privacy guarantees, we establish a minimum The column specifications in our table and view
definigranularity level, checking whether each group contains tions will allow to e. g. add randomized noise to facilitate
a suficient number of elements. Privacy rules can be user building diferential privacy on top of our framework.
or system defined: how to choose appropriate bounds They also allow defining sensitive columns, as well as
is still an open research question. However, this step minimum aggregation granularity, such as a minimum of
must be performed without revealing any content before at least e. g., 100 values for an aggregate result tuple that
guaranteeing that information is privacy-preserving: a involves this sensitive column to be included in it. We
possible technique is multi-party computation [12]. aim to develop declarative rules to identify potentially</p>
      <p>Our infrastructure then relays these results to the cen- privacy-breaking queries, which may necessitate SQL
tral server, where data analysts can query centralized extensions to provide an additional layer of security.
tables. The framework also periodically transmits cen- The previously described design provides an easy way
tral updates to the replicated tables, and general statistics for application architects to specify the components of
on the completeness of the incrementally maintained our infrastructure, however, it alone does not
guaranviews are available to the central database in order to tee to protect data owners from possible malevolence.
give accuracy bounds on query processing. When indi- Aggregate data could still contain sensitive information
vidual users of an application, on the other hand, need or not have a suficient level of granularity, failing to
query processing on their personal data stores or the provide anonymization. For example, it would be easy
to recognize individuals belonging to a group with only
one element. However, such calculations cannot be
performed until information from multiple PDS is obtained.</p>
      <p>In order to mitigate the potential risk of unauthorized
access to database records, the ofloading of computations
to a third-party entity can be considered. Nonetheless,
it is crucial to establish a foundation of trust in these
additional service providers. A possible solution is to
employ 3-way multi-party computation (MPC), either
with diferent cloud providers or a peer-to-peer system,
to hide information while it is being processed until it
respects our privacy constraints. The state-of-art system
Secrecy[14] allows secure collaborative analytics through
oblivious SQL queries. We plan to extend this framework
with IVM techniques to be able to perform bulk updates
and insertions.</p>
      <p>Expensive IVM operations such as joins can be
performed locally on PDS in plain text; results are then
applied to decentralized views and sent over a secure
communication channel (TLS) to be ultimately stored in
centralized views. Our MPC servers, therefore, only need
to append new rows or update the aggregated values in
single tables, which can be performed through cheap
oblivious arithmetic operations.</p>
      <p>Establishing trust in the organization responsible for
setting up our infrastructure, ensuring they fulfill their
claims, remains a prerequisite in this methodology. This
can be achieved through various approaches, including
enabling transparency by exposing all server trafic and
resource utilization, conducting audits, and leveraging
the use of open source technologies. However, our
exploration of eficient encryption within incremental view
maintenance is ongoing, and we remain open to
additional ideas that could enhance our approach.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pfitzmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <article-title>A terminology for talking about privacy by data minimization: Anonymity, unlinkability</article-title>
          , undetectability, unobservability, pseudonymity, and identity management,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. S.</given-names>
            <surname>Mumick</surname>
          </string-name>
          ,
          <article-title>Maintenance of materialized views: Problems, techniques, and applications</article-title>
          ,
          <source>IEEE Data Eng. Bull</source>
          .
          <volume>18</volume>
          (
          <year>1999</year>
          )
          <fpage>3</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Raasveldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mühleisen</surname>
          </string-name>
          ,
          <article-title>Duckdb: an embeddable analytical database</article-title>
          ,
          <source>in: Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference</source>
          <year>2019</year>
          , Amsterdam, The Netherlands, June 30 - July 5,
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>A. P. I.</surname>
          </string-name>
          et. al.,
          <article-title>Celliq : Real-time cellular network analytics at scale</article-title>
          ,
          <source>in: 12th USENIX Symposium on Networked Systems Design and Implementation</source>
          , NSDI
          <volume>15</volume>
          , Oakland, CA, USA, May 4-
          <issue>6</issue>
          ,
          <year>2015</year>
          ,
          <string-name>
            <given-names>USENIX</given-names>
            <surname>Association</surname>
          </string-name>
          ,
          <year>2015</year>
          , pp.
          <fpage>309</fpage>
          -
          <lpage>322</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Raasveldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mühleisen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Monetdblite:</surname>
          </string-name>
          <article-title>An embedded analytical database</article-title>
          , CoRR abs/
          <year>1805</year>
          .08520 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1805</year>
          .08520.
          <article-title>a r X i v : 1 8 0 5 . 0 8 5 2 0</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mühleisen</surname>
          </string-name>
          ,
          <article-title>Architecture-independent distributed query processing</article-title>
          ,
          <source>Ph.D. thesis</source>
          , Free University of Berlin,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Y. A.</surname>
          </string-name>
          et. al.,
          <article-title>Dbtoaster: Higher-order delta processing for dynamic</article-title>
          ,
          <source>frequently fresh views</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>a r X i v : 1 2 0 7 . 0 1 3 7</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            <surname>M. B.</surname>
          </string-name>
          et. al.,
          <article-title>Dbsp: Automatic incremental view maintenance for rich query languages</article-title>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>a r X i v : 2 2 0 3 . 1 6 6 8 4</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>