<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enabling Privacy-Preserving Data Aggregation in Networks of Personal Data Management Systems (Extended Abstract)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Julien Mirval</string-name>
          <email>julien.mirval@cozycloud.cc</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iulian Sandu-Popa</string-name>
          <email>iulian.sandu-popa@uvsq.fr</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luc Bouganim</string-name>
          <email>luc.bouganim@inria.fr</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Tran-van</string-name>
          <email>paul@cozycloud.cc</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cozy Cloud, ”Le Surena” face au 5 Quai Marcel Dassault</institution>
          ,
          <addr-line>92150 Suresnes</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>The development and adoption of Personal Data Management Systems (PDMS) have been fueled by technical means, privacy regulations, and smart disclosure initiatives. A PDMS makes it easier for users to collect, process, and share personal data. However, functionalities based on collective computations within a network of PDMSs are still lacking at least in commercial products. This demonstration bridges this gap by leveraging the open-source Cozy Cloud product and recent research results in the area of privacy-preserving decentralized data aggregation. Our demonstration scenario highlights both the utility aspect of collective computations and the main features of the aggregation protocol.</p>
      </abstract>
      <kwd-group>
        <kwd>Management Systems</kwd>
        <kwd>Personal data management systems</kwd>
        <kwd>Secure aggregation</kwd>
        <kwd>Peer-to-peer</kwd>
        <kwd>Federated learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        New privacy-protection regulations (e.g., GDPR) and smart disclosure initiatives in the last
decade have boosted the development and adoption of Personal Data Management Systems
(PDMSs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A PDMS (e.g., Cozy [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Nextcloud, Solid) is a data platform that allows users
to easily collect, store, and manage data into a single place, directly generated by the user’s
devices (e.g., quantified-self data, smart home data, photos) and data resulting from the user’s
interactions (e.g., social interaction data, health, bank, telecom). Users can then leverage the
power of their PDMS to benefit from their personal data for their own good and for the benefit
of the community [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The ambition of the existing PDMSs is to ofer functionalities covering
all the major steps in the data life-cycle [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]: (i) data backup and storage; (ii) data collection via
connectors to the typical online services holding user data (e.g., bank, telecom, shopping, social
networks, email); (iii) data sharing between user’s devices or between diferent users’ PDMSs;
and (iv) advanced personal computations allowing a user to cross her data from diferent data
silos (e.g., health records and physical activity data).
nEvelop-O
      </p>
      <p>
        However, the PDMS paradigm leads to a shift in the personal data ecosystem since data
becomes massively distributed, on the user side. To unlock innovative usages, individuals can
leverage their PDMSs by forming large communities of users sharing their data. This allows,
for example, to compute statistics for epidemiological studies or to train a Machine Learning
(ML) model for recommendation or classification systems. These usages however introduce
new security and performance issues, as evidenced by the large body of recent works in this
area [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. To enable such new usages in the PDMS context, we need new solutions adapted to
its specificity. These protocols need to protect user privacy and adapt to varying selectivity
(i.e., the consent of relevant participants). Ideally, the proposed protocol should provide an
accurate result that takes advantage of the high-quality data available in PDMSs. Eficiency
(i.e., protocol latency and total load of the system) is of prime importance given the potentially
limited communication speed or computation power of PDMSs. Finally, given the scale of such
decentralized aggregation, protocols must also be robust to node dropouts.
      </p>
      <p>
        Ensuring these properties altogether is challenging which might explain the lack of
functionalities implementing collective computations by the existing commercial PDMS solutions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
the privacy-preserving data aggregation in a network of PDMSs being mostly the focus of
research works and prototypes. This demonstration is a first step towards bringing closer
commercial PDMS solutions and recent academic results for privacy-preserving data
aggregation in a network of PDMSs. Specifically, the main contribution of this demonstration is to
integrate into an existing, classical, open-source PDMS solution, i.e., Cozy (see Section 2), a
privacy-preserving, scalable, and adaptive aggregation protocol (see Section 3) leveraging our
recent research results [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Thus, our demonstration shows that privacy-preserving collective
computations can be enabled in the PDMS context allowing users to benefit, through collective
data sharing, from the diverse, abundant and high-quality data stored in their PDMSs (e.g., by
collectively training an ML classifier). In addition to a walk around the classical functionalities
of the Cozy Cloud PDMS, our demonstration scenario (see Section 4) highlights both the utility
aspect of collective computations and the main features of the aggregation protocol.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. The Cozy Platform</title>
      <p>
        Cozy Cloud [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has been developing a PDMS platform for over a decade with the objective of
providing a ”smart digital home” that combines the comfort of having all your data stored and
processed in a single place, with the virtues of a reproducible open-source environment. Below,
we describe the main features of Cozy.
      </p>
      <p>Data collection. Cozy enables efortless automatic data collection through connectors which
fetch or scrap data from external service providers. Connectors are open-source and easy for
independent developers to create (e.g., there are more than 150 existing connectors, most of
them developed from the community).</p>
      <p>Data sharing. Data can be shared selectively with other users thanks to fine-grained control
over access groups and permissions. It can also be shared across the user’s devices to enable
accessing data locally, even during periods of no or low connectivity.</p>
      <p>Cross-domain local computations. Users can install apps and services that use their data
locally. Upon installation, users are presented with a detailed summary of the data the app
requires and the related purposes so that the users can give their informed consent. These apps
enable cross-domain computations to benefit from the variety of user data and provide useful
services and interesting analytics. An example is to compute your carbon footprint based on
mobility traces as well as home energy consumption data.</p>
      <p>Collective computations. In addition, these apps and services can also be used to coordinate
calculations between users. This can be used, for example, to anonymously compare previously
calculated carbon footprints with those of users around you. However, distributed computations
introduce a whole range of new security and privacy risks that are not, or poorly, addressed by
current PDMS solutions.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Privacy-Preserving Decentralized Data Aggregation</title>
      <p>
        This section summarizes the main design principles proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to fulfill the privacy,
accuracy, and adaptability properties. Before going into the aggregation protocol details, we
briefly introduce the considered computation and threat models.
      </p>
      <p>
        Computation model. This demo focuses on aggregation primitives which are essential to
compute basic statistics on user data and are also a fundamental building block for ML algorithms.
A model computation can be triggered by any node, i.e., querier, in a PDMS network. The
querier broadcasts the computation and each node consents or not to contribute, and in the
positive case is called contributor. Each node (contributor or not) may be a data processor
and is then called aggregator. Each contributor trains the model locally for several epochs as
described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and sends it to parent aggregators. Achieving a scalable aggregation process
requires multiple aggregators, naturally arranged in a tree structure (see Fig. 1.a) wherein the
intermediary nodes are aggregators and the leaves are contributors. The querier obtains the
result from the tree root.
such that ∑=1
      </p>
      <p>
        Threat model. As in the majority of secure aggregation (SA) works [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we consider the
classical honest-but-curious threat model, i.e., an attacker can access, but cannot alter, the data
manipulated by the attacked nodes (called leaking nodes). A PDMS can hold the entire digital life
of its owner and thus needs to be highly protected against privacy threats as indicated by recent
works [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, we consider that some PDMS owners have succeeded in tampering with
their PDMS since no security measure is unbreakable. Since attackers may collude and thus,
de facto, control more than one PDMS, the worst-case attack is represented by the maximum
number of colluding nodes controlled by a single “attacker”, i.e.,  leaking nodes.
Privacy and accuracy: We use a secret sharing scheme without threshold for data
confidentiality. Each contributor splits its private value into  shares, which leads to building 
separate (parallel) aggregation trees with exactly the same structure. This makes it impossible
to reconstruct the secret unless someone collects all  shares. This precludes inferences from an
attacker on any of the intermediate results (see Fig. 1.b). Each  ℎ share has the value   = / +  
 = 0, where  is the private value. Thus, shares from diferent contributors
are aggregated separately and if no share is missing (reliability is discussed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]), the final
result equals the exact sum of all private values, which is computed by the querier. Hence, our
protocol provides, by construction, accurate results. The number of shares,  , is computed such
that the probability to obtain  shares for an attacker, controlling  nodes, is inferior to  , a
Q Querier
A Aggregator
C Contributor
      </p>
      <p>Q</p>
      <p>A</p>
      <p>A A A
…….. A A A ……..</p>
      <p>C C C …. C
(a) Aggregation tree</p>
      <p>Sha
C(b) PrCivacy preservationC</p>
      <p>….</p>
      <p>Q</p>
      <p>VS
Contributors</p>
      <p>Contributors</p>
      <p>Q

f
a
Aggregatorshare1
Aggregatorshare2
Aggregatorshare3</p>
      <p>Aggregatorgroup
Contributor
Querier</p>
      <p>Aggregation tree
e
c
b</p>
      <p>d
……..…..………….. i
h
g
……..…….………..…….……………………
(c) Adaptability</p>
      <p>
        (d) Scalability: Recursive building of the privacy-preserving aggregation tree
minimal when  = ⌈ log()/ log(  )⌉.
security threshold (e.g.,  = 10 −6). Consequently, considering a uniform node distribution [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
the probability that an attacker controls an entire group is given by (/ )
 &lt;  . Then  is
Adaptability: The number of aggregators and their arrangement (i.e., the tree fan-out and
its height) is tuned as a function of the number of contributors, the communication costs
and the processing costs as discussed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This allows the protocol to always ofer
nearoptimal performance (i.e., aggregation latency) and achieve adaptability w.r.t. the computation
selectivity and PDMSs characteristics. Furthermore, our protocol can be configured to ofer
the desired trade-of between the latency and the total cost of the aggregation, which are
conflicting objectives : At one extreme, a binary aggregation tree maximally distributes the
load but increases the latency, at the other extreme, all contributors concentrate the load on a
single aggregator group (see Fig. 1.c).
      </p>
      <p>
        For the sake of brevity, we omit the details on the scalability property (see Fig. 1.d and details
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) and also the thorny issue of reliability (see [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]).
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Demonstration</title>
      <p>Our scenario will give a quick tour of Cozy’s functionalities and will mainly show the benefits
and feasibility of collective computations for ML applications in the PDMS context. Our
demonstration software is built as an app on Cozy’s platform and uses a set of connected PDMSs
that store banking operations to classify. A detailed explanation of the installation and usage
of the demo software can be found on a public repository 1 or in video2. We eventually aim to
provide this app to real end-users of Cozy.</p>
      <p>For the purpose of the demonstration, local PDMS instances are created and populated with
samples of test data (i.e., banking operations), one of them being the querier. A web interface
allows to view all of the instance’s banking operations. Each PDMS instance has some operations
that are already classified but it does not have enough classified data to train an accurate ML
model. In fact, the locally trained model may introduce too many classification errors that could
confuse the user, and considering those errors as unclassified is safer (see Fig. 2 left).</p>
      <p>From the demonstration platform’s interface, we can trigger the computation of a collectively
trained model leveraging a specific aggregation tree (see Section
3). The platform assigns
1https://github.com/cozy/dissec_cozy/blob/master/DEMONSTRATION.md
2https://clipchamp.com/watch/7vsaYZyPEDa
instances to the tree to facilitate tuning the tree structure. The aggregation process is visible
in real-time (see Fig. 2 center) as the assigned nodes start processing data and transmitting
intermediate results, starting from contributors at the bottom and up to the querier at the top,
which recomposes the global model.</p>
      <p>Once the global model is available, we can rerun the classification of the banking operations.
We observe that this model can efectively classify all the banking operations (see Fig. 2 right).
Also, the demonstration interface allows the audience to comprehend the properties of the
aggregation protocol and in particular the computation security, i.e., an attacker controlling a
subset of the PDMS instances cannot gain any knowledge on the private data from other nodes.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Anciaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bouganim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , et al.,
          <source>Personal Data Management Systems: The Security and Functionality Standpoint, Information Systems</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Cozy</given-names>
            <surname>Cloud</surname>
          </string-name>
          , Cozy Cloud (see https://cozy.io/fr/),
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Commission</surname>
          </string-name>
          ,
          <article-title>Proposal for a Regulation on European Data Governance (Data Governance Act)</article-title>
          , COM/
          <year>2020</year>
          /767. [eur-lex], 25
          <string-name>
            <surname>October</surname>
          </string-name>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Önen</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Jaballah</surname>
          </string-name>
          , M. Conti,
          <article-title>SoK: Secure Aggregation Based on Cryptographic Schemes for Federated Learning</article-title>
          ,
          <source>PETS</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mirval</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bouganim</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Sandu-Popa, Practical Fully-Decentralized Secure Aggregation for Personal Data Management Systems</article-title>
          , in: SSDBM,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mirval</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bouganim</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Sandu-Popa, Federated learning on personal data management systems: Decentralized and reliable secure aggregation protocols</article-title>
          ,
          <source>in: SSDBM</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>McMahan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hampson</surname>
          </string-name>
          , et al.,
          <article-title>Communication-Eficient Learning of Deep Networks from Decentralized Data</article-title>
          ,
          <string-name>
            <surname>PMLR</surname>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Anciaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bouganim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pucheral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sandu-Popa</surname>
          </string-name>
          , et al.,
          <article-title>Personal Database Security and Trusted Execution Environments: A Tutorial at the Crossroads</article-title>
          ,
          <source>PVLDB</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>