<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Privacy-preserving Decentralized Learning of Knowledge Graph Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anh-Tu Hoang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ahmed Lekssays</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Barbara Carminati</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Ferrari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli Studi dell'Insubria</institution>
          ,
          <addr-line>Varese</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge Graphs (KGs) enhance the performance of machine learning applications, such as recommendation systems and drug discovery. This is achieved through vector representations of KGs semantics, called Knowledge Graph Embeddings (KGEs). However, obtaining adequate data to train high-quality KGEs can be challenging for individual service providers. FedE and FedR address this challenge by enabling federated learning of KGEs without sharing local KGs, but they are limited by their reliance on trusted servers and lack of protection against inference attacks. Recently, FKGE has been proposed to enable collaboration between providers in the training of KGEs, exploiting diferential privacy. Nevertheless, updating KGEs from all providers is time-consuming, and it does not protect against poisoning and backdoor attacks. Following this research direction, this paper focuses on the security and privacy requirements for decentralized learning of KGEs, presents a reference architecture to support these requirements, and discusses its security and privacy limitations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;distributed learning</kwd>
        <kwd>security</kwd>
        <kwd>diferential privacy</kwd>
        <kwd>knowledge graph embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        showed that relying on the trusted servers is impractical
and introduced a new attack, i.e, a KG reconstruction
Knowledge graphs (KGs) are attracting giant tech compa- attack, able to infer the local KGs by relying on the
collunies thanks to their ability to represent knowledge about sion between the server and one peer. To mitigate this
both entities’ attributes and their relationships. The com- attack, the authors in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed FedR. Instead of
sharbination of entities’ attributes and relationships in KGs ing entities’ embeddings, FedR shares the relations ones.
improves the quality of various AI applications such as However, FedR failed to completely mitigate the attack. If
recommender systems (e.g., Amazon Product KGs1) and a server is untrusted, it can perform the attack even
withdrug discovery (e.g., AstraZeneca’s Drug Discovery KGs) out collusion with any peers. It can perform inference
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These applications rely on KGs’ embeddings (KGEs), attacks on shared relations’ embeddings to infer entities’
which are vector representations of entities in KGs, de- existence. Moreover, it can simply refuse to aggregate
ifned such that the semantics between the entities are model updates which makes the whole framework
limpreserved. However, learning embeddings requires com- ited.
bining a large number of KGs, and sharing the KGs di- On the other side, Diferential Privacy (DP) was
prorectly can violate the privacy of the entities. posed by Dwork et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to ensure that statistics
ex
      </p>
      <p>
        Recently, FedE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has been designed to enable multi- tracted from relational data do not reveal the existence
ple data providers to jointly train the embeddings under of users, even if the attacker exploits any background
federated learning (FL) settings. FedE utilizes a trusted knowledge. To that end, DP adds noises to the data so
server to collect the entities of the providers’ KGs, ag- that the statistics extracted from the data remain the
gregate the providers’ embeddings, and distribute the same whether the user is present or not. DP has been
aggregated embeddings to all providers. However, even used to train machine learning and deep learning
modthough the local KGs are not shared, Zhang et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] els in such a way that the presence of users whose data
are used to train the models is concealed [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6, 7</xref>
        ]. In
EDBT/ICDT 2023 Workshops particular, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed DP-SGD, which adds noises to
* Corresponding authors. the updated weights of the models on data (e.g., images,
† These authors contributed equally. text) to ensure that final weights satisfy DP. In this way,
$ anhtu.hoang@uninsubria.it (A. Hoang); alekssays@uninsubria.it attackers cannot exploit the trained models to infer the
(eAle.nLae.fkesrsraayris@); ubnarinbasurab.rciaar.imt i(nEa.tFi@erruanriin)subria.it (B. Carminati); users’ existence. However, DP-SGD decreases the quality
0000-0002-1027-3905 (A. Hoang); 0000-0001-5783-8638 of the models when they are trained with a high number
(A. Lekssays); 0000-0002-7502-4731 (B. Carminati); of epochs. To train a private model on the data with
im0000-0002-7312-6769 (E. Ferrari) proved quality, PATE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and PATE-GAN [7] have been
1 hCPWrEooUrctkReshdtoinpgpsIhsStpN:/c1e6:u1r3-w-/0s.o7r3g/w©LCwic2Ee0wn2Us3e.CRaAobtptWryoirbiuugothtirotank©m4s2.0h0a2Ion3ztpfeoorrnnPath.triciosoonpacamlpe(Cee/rCndbByeiYinwt4sg.a0su)s./thi(onCrsnE.UoUsveRape-trWimoitnSte-.doaurntg-de)r Creative Commons proInpotsheed.context of KGs, recently, FKGE [8] adopted
amazon/making-search-easier PATE-GAN [7] to ensure user privacy while allowing
providers to train their entities’ embeddings in federated models (e.g., Convolution network, Full-connected
netlearning settings. FKGE allows each provider to connect work) to support various tasks such as link prediction,
to another provider and improves their entities’ embed- and entity classification. To preserve the semantics, the
dings. However, because each provider only connects to embeddings must be trained by using KGE models [9],
one provider at a time, connecting to and updating their whose train score functions estimate the plausibility of
embeddings from all providers will take a long time. KGs’ edges. KGE models can be classified into
transla
      </p>
      <p>
        To address these limitations, we propose a new privacy- tional distance, semantic matching, and neural network
preserving decentralized learning framework for knowl- models.
edge graph embeddings. Our approach difers from Translational distance models (e.g., TransE [9])
meaFKGE [8] in that we enable a peer to enhance its embed- sure the plausibility of an edge as the distance between its
dings using the embeddings of multiple peers simultane- entities’ embeddings. TransE [9] first represents entities
ously, whereas FKGE [8] restricts each peer to connect to and relations as vectors with equal dimensions. Then, it
only one peer at a time for embedding improvement. Fur- estimates the plausibility of an edge through a scoring
thermore, FedE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] relies on a trusted server to aggregate function that measures the distance between the edge’s
embeddings, whereas our work is fully decentralized. Be- head and tail entity’s embeddings which are connected
cause the training is done asynchronously, the proposed by the edge’s relation’s embedding. By minimizing the
architecture (1) speeds up the embeddings’ training and function, the embedding of the tail entity should be close
(2) improves the embeddings’ quality because they come to the embedding of the head entity plus the relation’s
emfrom a variety of sources [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The proposed framework is bedding. TransE has been extended to TransD, TransH,
based on IOTA, a permissionless distributed ledger, where and TransR [9] to deal with 1-N, N-1, and N-N relations.
users can submit the metadata of their trained embed- Semantic matching models (e.g., RESCAL, DisMult [9])
dings and store the embeddings on IPFS, a distributed exploit similarity-based distance to measure the
plausibiliflesystem. When a user wants to train embeddigns for ity of an edge by matching their embeddings’ dimensions.
the same task (e.g., link prediction, entity classification), RESCAL [9] associates each entity with a vector, while
he/she takes the most recent embeddings of the task and each relation is represented by a matrix. The score
funcaggregates them with her own ones. In turn, the updated tion measures the plausibility of an edge by using the
embeddings are then shared on the distributed ledger. We bilinear function of its entities’ embeddings and its
relause DP to protect the user’s privacy and prevent attackers tion’s matrix. DisMult [9] simplifies RESCAL by using
from inferring data from the shared embeddings. diagonal matrices.
      </p>
      <p>This paper is organized as follows. Section 2 reviews Neural network models use neural network techniques
related work. Section 3 illustrates the problem state- (e.g., convolution network, graph convolution network,
ment, threat model and requirements of the proposed neural network) to measure the score functions. In NTN
platform. In Section 4, we introduce the proposed archi- [9], the neural network takes as input the embeddings of
tecture, whereas we discuss how the platform addresses an edge’s entities. Then, by training the network with
the requirements in Section 5. We conclude this work in KGs’ edges, it learns the plausibility of KGs’ edges.
InSection 6. stead of taking as input an edge, RGCN [10] receives
an entity’s features and its neighbors’ features. Then,
RGCN uses a graph convolution network to infer the
2. Related Work edge’s embedding.</p>
      <p>In this work, we implemented a decentralized learning
platform that allows data providers to train all of the
above-mentioned KGEs without sharing the local KGs.</p>
      <sec id="sec-1-1">
        <title>In this section, we review state-of-the-art techniques for knowledge graph embeddings, diferential privacy, and federated learning.</title>
        <sec id="sec-1-1-1">
          <title>2.1. Knowledge Graph Embeddings</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>2.2. Diferential Privacy</title>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Diferential privacy (DP) [ 4] has been presented to extract</title>
        <p>A Knowledge Graph (KG) is composed of edges (aka information from a dataset while hiding the existence of
triplets) consisting of a head entity, a tail entity, its entities. This is done by adding noises to the
informaand a relationship connecting them. For instance, tion before sharing it. The amount of noise required to be
(, , ) is a triplet expressing that ’s added depends on privacy parameters (e.g.,  ,  ), and the
 is . Knowledge graph embeddings (KGEs) sensitivity of the extraction function. Here,  illustrates
are low-dimensional vectors representing entities and how similar the noisy information and the original ones
relations in knowledge graphs (KGs) such that the se- are, whereas  is the probability that DP fails to protect
mantics between the entities and relations are preserved. entities’ privacy. The sensitivity of a function is the
maxThe embeddings can be used as inputs in deep learning imum change of the function’s outputs when removing
a single entity from the dataset. issues by creating a distributed learning platform using</p>
        <p>
          DP-SGD [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] was developed to avoid making assump- PATE-GAN [7]. In particular, each provider sequentially
tions about the presence of entities in KG, whose data connects to another one and uses PATE-GAN [7] to train
is used to train deep learning models. To this end, it its KG’s embeddings. If the trained embeddings have
adds noise to the models at every epoch. However, the higher quality than the old ones, the provider updates
higher the number of epochs, the more noises are added. its embeddings. However, since FKGE only connects to a
Since the noises reduce the quality of the trained mod- provider at a time, it takes a long time for all providers
els (e.g., classification accuracy), DP-SGD generates low- to train their KGEs.
quality models when training with a high number of In this work, we develop a platform that allows a
epochs. PATE [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] addresses this issue by training mod- provider to use embeddings of many providers at the
els from privacy-aware datasets, that is, whose data are same time. Therefore, we can improve the time required
non-sensitive and whose labels are generated by DP. In for all providers to train their KGEs while applying
PATEparticular, it creates many models, called teacher models, GAN [7] to protect entities’ privacy.
each of which is trained on a non-overlapped part of the
original datasets. The teacher models are then used to
generate labels for public datasets. Finally, PATE uses 3. Privacy-preserving
the public datasets and their generated labels to train the decentralized learning of
ifnal models. The higher the size of the public datasets, knowledge graph embeddings
the more noise PATE adds. PATE-GAN [7] has been
introduced to reduce the amount of added noises by using In this section, we first introduce the problem statement.
GAN to learn the labels. In this paper, we apply PATE and Then, we explain the privacy and security requirements.
PATE-GAN to generate privacy-aware KGs’ embeddings.
        </p>
        <sec id="sec-1-2-1">
          <title>2.3. Federated Learning</title>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>Federated learning allows many providers to collabo</title>
        <p>
          ratively train their models without trusted centralized
servers [11]. Its great success on image datasets led to
the applications of federated learning in training KGEs.
FedE [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] is the first federated learning technique allowing
providers to train KGEs for link prediction tasks without
sharing their local KGs. The training process starts with
a trusted server finding the set containing all entities
from all of the providers. Then, it initializes a random
embedding for each entity. In each iteration, the server
sends the current version of its embeddings to providers.
Each provider keeps updating the received entity
embeddings with their local KGs. When the training is finished,
all providers send the updated local embeddings to the
server. The server aggregates all of the local entity
embeddings to create a new version of the embeddings and
continues the next iteration. The training process stops
after a fixed number of iterations. Although the providers
do not share their KGs, attackers can exploit membership
attacks on the entity embeddings sent from the providers
to infer the existence of entities/triplets used to train
the embeddings. Moreover, the servers can compromise
users’ privacy since they know which local KGs contain
users’ data. If the servers collude with a client, they can
also reconstruct the local KGs [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. FedR [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] has been
introduced to protect entities’ privacy by not sharing
the entities’ identities and embeddings but the relations’
ones. However, although the identities are not shared,
the providers can still violate entities’ privacy by using
the shared relations’ embeddings. FKGE [8] remedies the
        </p>
        <sec id="sec-1-3-1">
          <title>3.1. Problem Statement</title>
          <p>Given a set of providers  , each provider  ∈  holds a
knowledge graph , formally defined as (, , ),
where , , and  are the set of nodes (i.e., entities
and attribute values), relations, and triplets of . Each
triplet (ℎ, , ) ∈  is an edge of  illustrating the
relations between nodes in , where ℎ,  ∈  and
 ∈ . Each provider  wants to train embeddings
of its entities (denoted as ℰ) and relations (ℛ) for a
specific task (e.g., link prediction, node classification)
without sharing its KG .</p>
          <p>To this end, the provider  first initializes its entities’
and relations’ embeddings. Then, at time , it collects the
embeddings shared by other providers −  ⊆  ∖ {}
whose train their embeddings for the same task. Let E− 
be the set of collected embeddings till time . It
aggregates the embeddings in E−  with its embeddings at time
 − 1 (ℰ− 1, ℛ− 1) to obtain the initial embeddings at
 . Then, it trains ℰ, ℛ with its local KGs
time : ℰ , ℛ
and evaluates the quality of the trained embeddings. If
the quality is improved, it shares ℰ , ℛ and continues

selecting other providers’ embeddings until it cannot
improve its embeddings anymore. In Section 4, we present
our platform’s architecture supporting this scenario.</p>
        </sec>
        <sec id="sec-1-3-2">
          <title>3.2. Privacy and security requirements</title>
          <p>Federated learning and decentralized learning paradigms
were proposed as a first step towards preserving the
privacy of users’ data since the training is done locally and
the users share only the weights of the models 2. Such
paradigms, however, have been shown to be
vulnerable to the inference attack [12]. In addition, training
models in federated/decentralized settings are vulnerable
to other security attacks (e.g., poisoning attacks) where
attackers try to manipulate the models to predict
specific classes wrongly or lower its performance regarding
all the classes [13]. Federated/decentralized learning of
KGEs shares the same issues. In this section, we analyze
the privacy and security of decentralized learning of KG
embeddings under the following threat model. We start
by the assumptions on providers and attackers.
For an attack to succeed, an attacker must have an
influence on a target class more than the total influence of the
honest clients on that class. So, an attacker might try to
collude with other attackers to have the same objective
of manipulating the prediction for a target class. The
proposed defense is expected to be robust against colluding
attacks as well.</p>
          <p>To perform the inference attacks, a provider must
obtain the shared entities’ and relations’ embeddings. Then,
the provider can try to analyze the embeddings with the
background knowledge it has. The proposed defense
must be invulnerable to the analysis without having any
assumption on the knowledge it uses.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Architecture</title>
      <p>Providers’ Assumptions. We assume that the
majority of providers in the system are honest, and that, given a
specific model, the majority of actors training that model
are honest and have the same training objective.</p>
      <p>The trained embeddings in such paradigms are
vulnerable to several security attacks (e.g., poisoning attacks,
both for data and models, backdoor attacks, etc.) [14].
These attacks can be performed in a coordinated way
which brings up the need to mitigate collusion attacks
as well [15]. In other words, the data providers could
collude with each other.</p>
      <sec id="sec-2-1">
        <title>In this section, we proposed the general architecture in</title>
        <p>support of privacy-preserving decentralized learning of
KGEs. It relies on a public distributed ledger technology
called IOTA [17] to store embeddings’ metadata and on
a decentralized filesystem called IPFS [ 18] to store the
embeddings’ weights. Figure 1 shows the overall
architecture of the system.</p>
        <p>The proposed system is a fully decentralized learning
framework where users with the same training
objecAttacker’s Goal. For security, the attacker intends to tive can train embeddings collaboratively. In addition,
manipulate the updates of the trained embeddings by in- it supports the training of multiple embeddings
asynjecting malicious updates into the system. These updates chronously. Training a KGE model begins with
submitcan be random or crafted to manipulate the prediction of ting a transaction (i.e., message) to IOTA containing the
the model using the trained embeddings. For the crafted model metadata (i.e., KG embeddings) such as the model
updates, we focus on two well-known attacks in decen- identifier, the path to model weights in IPFS, and the
tralized learning, namely label-flipping and backdoor at- parent model updates that were aggregated from3. As
tacks [13]. In a label-flipping attack, the attacker flips the a result, each model update refers to other  model
uplabels of the local training samples from one source class dates that came before it. In other words, each model
to another target class, while keeping the other classes update aggregates model updates that were submitted
beunchanged. Backdoor attacks involve the attacker embed- fore. This aggregation is done only if old model updates
ding special patterns into the original training samples, improve the accuracy of the user’s local model. Hence,
such as patches of pixels, and changing their labels to typically, the latest model updates are better versions of
the target label. The patterns are a trigger for the target the same model. So, each model will be represented as
class [16] a directed acyclic graph as represented in Figure 1. It</p>
        <p>For the privacy attacks, we consider inference attacks is worth noting that the parameter  is chosen by the
[12]. Here, we assume providers can access shared em- deployer of the system.
beddings, which are trained on local KGs. Thus, the We use DP to prevent privacy attacks on shared
emprivacy attacks allow the providers to infer the existence beddings. To prevent DP from adding too much noise, we
of triplets used to train the embeddings. allow each provider to specify the upper-bound
threshold on the privacy budgets (i.e.,  ,   for provider ).</p>
        <p>
          Attacker’s Capabilities. The attackers can individu- When training the updated embeddings with diferential
ally or collaboratively perform the above-mentioned at- privacy techniques (e.g., DP-SGD[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], PATE-GAN[7]), a
tacks. We assume that each malicious actor (i.e., provider) provider can estimate the current privacy costs by
uscan manipulate her own training data (i.e., KGs), but ing sequential and parallel composition. In particular,
he/she cannot access or manipulate other actors’ data. let   and   be the current privacy budgets at time  of
provider . By training with DP-SGD[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], the provider
2In this paper, we use models’ weights and embeddings interchange- 3This field is empty for the initial transaction of a model, which is
ably. called a genesis transaction.
can increase   and   by its privacy parameters (i.e.,  
and  ) every epoch. Training with PATE-GAN[7], the
provider can increase   and   according to the number
of triplets of its public KG. Here, DP-SGD and PATE-GAN
generate noises such that the existence of any triplets
in  is hidden. Since privacy budgets can be calculated
before training, the provider can easily estimate whether
they are greater than the upper-bound threshold. If this
is the case, the provider stops the training. Otherwise,
it starts training the new embeddings. After the
training process is finished, it shares the embeddings and the
privacy budgets used to train them.
        </p>
        <p>Honest providers will report their embeddings and the
privacy budgets used to train them. In the case of
malicious providers, they might aim to disturb the learning
phase by setting up low  and high  . Thus, DP adds
more noise to the embeddings and the probability that
it cannot prevent privacy attacks is also increased
(Section 2.2). However, in this case, the malicious providers’
embeddings will be noisy and not be used for training by
honest clients.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Discussion</title>
      <sec id="sec-3-1">
        <title>In this section, we show how the proposed platform addresses security and privacy requirements presented in Section 3. In addition, we discuss the scalability of our framework based on the architecture shown in Figure 1.</title>
        <sec id="sec-3-1-1">
          <title>5.1. Security</title>
          <p>The proposed framework ensures transparent,
multimodel, decentralized learning. However, training models
without the validation of shared weights would lead to
manipulating their predictions. Model updates might be
malicious, as they could be crafted to initiate
(individually or collaboratively) poisoning and backdoor attacks
discussed in Section 3.2. In this section, we discuss
possible mitigations of these attacks. Since we assume that
the majority is honest, honest actors can check the
validity of the model updates by considering the actor’s
local model a reference (i.e., global model) and
computing the euclidean distance or cosine similarity to exclude
adversarial updates [19, 13]. These techniques could be
extended to the mitigation of such attacks when
performed in a collaborative manner (i.e., Sybil attacks)
[20, 13]. This line of defense is based on the observation
that malicious actors have a similar objective. Hence,
when computing their cosine similarities, the angle
between their model updates will be slight and the cosine
similarity will be higher. However, such defenses fail
when the malicious actors submit random weights (i.e.,
untargeted attacks). In other words, the malicious
actors do not have any common objective. Thus, there is a
need to combine such defenses with other defenses such
as KRUM, Multi-Krum, Trimmed Mean, etc. [21] that
are able to discard such updates. It is worth noting that
such defenses can discard model updates coming from
honest clients with strict privacy budgets since their
updates could be considered random, leading to untargeted
attacks. Other defenses based on Trusted Execution
Environment [22] or manipulation of the models to reduce
their sizes [23] were proposed. However, they come with
major utility limitations that cannot be extended to a
decentralized setting. The discussed defenses are designed
to work in centralized environments where there is a
central aggregation server that runs them. So, there is a
need to replace the centralized component when
validating model updates. This limitation could be addressed
by forming a model-specific, randomly and periodically
elected committee that reviews a batch of model updates</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>using the aforementioned defenses. It is worth noting
that such committees, often called dynamic committees
are used in Proof-of-Stake blockchains such as Algorand This work has received funding from the Marie
[24]. Skłodowska-Curie Innovative Training Network
Real</p>
      <p>Since some honest updates with relatively small pri- time Analytics for Internet of Sports (RAIS), supported
vacy budgets could be discarded, selecting the threshold by the European Union’s Horizon 2020 research and
infor discarding model updates is very crucial in maintain- novation programme under grant agreement No 813162.
ing good training performance in decentralized learning. Additionally, it has been partially supported by
CONCORHowever, the combination of security techniques with DIA, the Cybersecurity Competence Network supported
privacy requirements and their possible side efects re- by the European Union’s Horizon 2020 research and
inmain a future work. novation programme under grant agreement No 830927.
The content of this paper reflects the views only of their
5.2. Privacy author(s). The European Commission/Research
Executive Agency are not responsible for any use that may be
made of the information it contains.</p>
      <sec id="sec-4-1">
        <title>The proposed framework prevents the inference attacks</title>
        <p>
          (Section 3.2) by only sharing the privacy-aware
embeddings that are trained with DP techniques (e.g., DP-SGD
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], PATE-GAN [7]). Here, providers add noises such that
their trained embeddings are similar whether a triplet
is used to train them. Therefore, according to the DP
Sequential and Parallel Composition [
          <xref ref-type="bibr" rid="ref5">5, 7</xref>
          ], the existence
of any triplets they used to train their embeddings is
hidden, no matter what background knowledge dishonest
providers use.
        </p>
        <sec id="sec-4-1-1">
          <title>5.3. Scalability</title>
          <p>The scalability of our framework is tied up to the way we
handle model updates. Since each node keeps a directed
acyclic graph of each model, adding nodes (i.e., model
updates) to the graph is not a heavy task since only the
metadata is appended to the nodes. Our framework relies
on IPFS for storing raw model updates and IOTA for
storing metadata. Hence, it does not handle the burden
of storage, but rather it connects through HTTP to both
IPFS and IOTA. So, the clients download only the model
updates that they are interested in.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion</title>
      <sec id="sec-5-1">
        <title>In this paper, we highlight the security and privacy re</title>
        <p>quirements of decentralized knowledge graph
representation and propose an architecture for training such models.</p>
        <p>We presented the diferent privacy-preservation
techniques used in knowledge graphs and the security
defenses that mitigate various types of attacks in machine
learning, such as poisoning attacks and backdoor attacks,
when performed individually or collaboratively (i.e., Sybil
attacks). We discussed the limitations that the interplay
between security and privacy causes. The solutions and
analysis of this interplay remain a future work.
for deep learning from private training data, in: Generation Computer Systems 115 (2021) 619–640.
International Conference on Learning Representa- [15] C. Fung, C. J. Yoon, I. Beschastnikh, The limitations
tions, 2017. URL: https://openreview.net/forum?id= of federated learning in sybil settings., in: RAID,
HkwoSDPgg. 2020, pp. 301–316.
[7] J. Jordon, J. Yoon, M. van der Schaar, PATE- [16] Y. Liu, S. Ma, Y. Aafer, W.-C. Lee, J. Zhai, W. Wang,
GAN: generating synthetic data with diferential X. Zhang, Trojaning attack on neural networks
privacy guarantees, in: 7th International Con- (2017).
ference on Learning Representations, ICLR 2019, [17] S. Popov, H. Moog, D. Camargo, A. Capossele,
New Orleans, LA, USA, May 6-9, 2019, OpenRe- V. Dimitrov, A. Gal, A. Greve, B. Kusmierz,
view.net, 2019. URL: https://openreview.net/forum? S. Mueller, A. Penzkofer, et al., The coordicide,
id=S1zk9iRqF7. Accessed Jan (2020) 1–30.
[8] H. Peng, H. Li, Y. Song, V. Zheng, J. Li, Difer- [18] J. Benet, Ipfs-content addressed, versioned, p2p file
entially private federated knowledge graphs em- system, arXiv preprint arXiv:1407.3561 (2014).
bedding, in: Proceedings of the 30th ACM Inter- [19] D. Cao, S. Chang, Z. Lin, G. Liu, D. Sun,
Undernational Conference on Information and Knowl- standing distributed poisoning attack in federated
edge Management, CIKM ’21, Association for Com- learning, in: 2019 IEEE 25th International
Conferputing Machinery, New York, NY, USA, 2021, p. ence on Parallel and Distributed Systems (ICPADS),
1416–1425. URL: https://doi.org/10.1145/3459637. IEEE, 2019, pp. 233–239.</p>
        <p>3482252. doi:10.1145/3459637.3482252. [20] C. Fung, C. J. Yoon, I. Beschastnikh,
Mitigat[9] Y. Dai, S. Wang, N. N. Xiong, W. Guo, A survey ing sybils in federated learning poisoning, arXiv
on knowledge graph embedding: Approaches, ap- preprint arXiv:1808.04866 (2018).
plications and benchmarks, Electronics 9 (2020) [21] P. Blanchard, E. M. El Mhamdi, R. Guerraoui,
750. J. Stainer, Machine learning with adversaries:
[10] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Byzantine tolerant gradient descent, Advances in
Berg, I. Titov, M. Welling, Modeling relational neural information processing systems 30 (2017).
data with graph convolutional networks, in: [22] F. Mo, H. Haddadi, Eficient and private federated
A. Gangemi, R. Navigli, M.-E. Vidal, P. Hitzler, learning using tee, in: Proc. EuroSys Conf., Dresden,
R. Troncy, L. Hollink, A. Tordai, M. Alam (Eds.), Germany, 2019.</p>
        <p>The Semantic Web, Springer International Pub- [23] Y. Jiang, S. Wang, V. Valls, B. J. Ko, W.-H. Lee, K. K.
lishing, Cham, 2018, pp. 593–607. doi:10.1007/ Leung, L. Tassiulas, Model pruning enables eficient
978-3-319-93417-4\_38. federated learning on edge devices, IEEE
Transac[11] B. McMahan, E. Moore, D. Ramage, S. Hampson, tions on Neural Networks and Learning Systems
B. A. y Arcas, Communication-eficient learning (2022).
of deep networks from decentralized data, in: [24] J. Chen, S. Gorbunov, S. Micali, G. Vlachos,
AlgoA. Singh, X. J. Zhu (Eds.), Proceedings of the 20th rand agreement: Super fast and partition resilient
International Conference on Artificial Intelligence byzantine agreement, Cryptology ePrint Archive
and Statistics, AISTATS 2017, 20-22 April 2017, (2018).</p>
        <p>Fort Lauderdale, FL, USA, volume 54 of
Proceedings of Machine Learning Research, PMLR, 2017, pp.
1273–1282. URL: http://proceedings.mlr.press/v54/
mcmahan17a.html.
[12] S. Truex, N. Baracaldo, A. Anwar, T. Steinke, H.
Ludwig, R. Zhang, Y. Zhou, A hybrid approach to
privacy-preserving federated learning, in:
Proceedings of the 12th ACM workshop on artificial
intelligence and security, 2019, pp. 1–11.
[13] S. Awan, B. Luo, F. Li, Contra: Defending against
poisoning attacks in federated learning, in:
Computer Security–ESORICS 2021: 26th European
Symposium on Research in Computer Security,
Darmstadt, Germany, October 4–8, 2021, Proceedings,</p>
        <p>Part I 26, Springer, 2021, pp. 455–475.
[14] V. Mothukuri, R. M. Parizi, S. Pouriyeh, Y. Huang,</p>
        <p>A. Dehghantanha, G. Srivastava, A survey on
security and privacy of federated learning, Future</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gogleva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Polychronopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pfeifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Poroshin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ughetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Thorpe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bornot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sidders</surname>
          </string-name>
          , et al.,
          <article-title>Knowledge graph-based recommendation framework identifies drivers of resistance in egfr mutant non-small cell lung cancer</article-title>
          ,
          <source>Nature communications 13</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Fede:
          <article-title>Embedding knowledge graphs in federated setting</article-title>
          ,
          <source>in: The 10th International Joint Conference on Knowledge Graphs, IJCKG'21</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>80</fpage>
          -
          <lpage>88</lpage>
          . URL: https://doi.org/ 10.1145/3502223.3502233. doi:
          <volume>10</volume>
          .1145/3502223. 3502233.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Eficient federated learning on knowledge graphs via privacy-preserving relation embedding aggregation</article-title>
          ,
          <source>CoRR abs/2203</source>
          .09553 (
          <year>2022</year>
          ). URL: https://doi.org/10.48550/arXiv.2203.09553. doi:
          <volume>10</volume>
          .48550/arXiv.2203.09553. arXiv:
          <volume>2203</volume>
          .
          <fpage>09553</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dwork</surname>
          </string-name>
          ,
          <article-title>Diferential privacy</article-title>
          , in: M.
          <string-name>
            <surname>Bugliesi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Preneel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Sassone</surname>
          </string-name>
          , I. Wegener (Eds.),
          <source>Automata, Languages and Programming</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2006</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chu</surname>
          </string-name>
          , I. Goodfellow, H. B.
          <string-name>
            <surname>McMahan</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Mironov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Talwar</surname>
            ,
            <given-names>L. Zhang,</given-names>
          </string-name>
          <article-title>Deep learning with diferential privacy</article-title>
          ,
          <source>in: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security</source>
          , CCS '16,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2016</year>
          , p.
          <fpage>308</fpage>
          -
          <lpage>318</lpage>
          . URL: https://doi.org/10.1145/2976749. 2978318. doi:
          <volume>10</volume>
          .1145/2976749.2978318.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Papernot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abadi</surname>
          </string-name>
          , Ú. Erlingsson, I. Goodfellow,
          <string-name>
            <given-names>K.</given-names>
            <surname>Talwar</surname>
          </string-name>
          ,
          <article-title>Semi-supervised knowledge transfer</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>