<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ethics and Executability: Tracing Decency in Decentralised Knowledge Graph Applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aisling Third</string-name>
          <email>aisling.third@open.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John Domingue</string-name>
          <email>john.domingue@open.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Knowledge Media Institute, The Open University</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Increasing interest in decentralisation for data and processing on the Web brings with it the need to re-examine methods for verifying data and behaviour for scalable multi-party interactions. We consider factors previously identified as relevant to verification of processing activity on knowledge graphs in a Trusted Decentralised Web, and use an implementation scenario to identify focused open questions.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Trust</kwd>
        <kwd>Decentralisation</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Federated Querying</kwd>
        <kwd>Solid</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        There is increasing public awareness that personal data is valuable and vulnerable, and that data
about individual behaviour is routinely aggregated and exploited for commercial gain. It is widely
recognised that centralisation is a factor in enabling this exploitation; large providers such as social
networks integrate many services, of their own and from third parties, tying them to digital identities
which they provide and control, allowing them to serve as hubs for data collection and analysis. This
model requires that trust be placed in the centralising parties: trust that identity will be faithfully
represented, personal data will be kept securely and privately, and data processing will only be carried
out in line with stated policies. Generally, the only view an individual has on data processing is from
seeing the results for them of using a centralised service; it is not transparent whether their data was
copied or used for anything else. Technologies for a decentralised Web seek to provide standardised
alternatives to this model which return agency to individuals. Linked Data Platforms such as Solid [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
are designed to support these models. Third-party Web applications for these platforms would no longer
run against large databases of information aggregated across all users of a centralised site, but instead
query each user’s personal data store directly whenever needed.
      </p>
      <p>
        “Trust”, in general, is a broad term with multiple facets. One might trust the outputs of a service if
they repeatedly prove to be accurate or useful; perform well in terms of reliability or responsiveness;
give value for money; behave ethically, and so on. We focus on trust that a service provider is
wellbehaved with personal data in the sense of, e.g., being ethical, doing only what the data’s subject
permits, being transparent and truthful about what has been done, and protecting the user’s privacy in
both processing and transparency. Ethical behaviour which is reassuring to users about service activity
is what we call decent. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we argued for a set of factors relevant to Web applications demonstrating
decent data behaviour when it comes to federated querying of decentralised knowledge graphs, with a
combination of data verification technologies, machine-readable data usage policy validation, and
traces of data processing activities. Here, we very briefly summarise that paper, and step through a
scenario in which those factors could be taken into account. From this, we identify topics which need
to be explored to enable even a very simple scenario based on favourable assumptions to be handled.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Decency and Decentralisation</title>
      <p>We are interested in a landscape in which individuals’ personal data is kept in private
personallycontrolled “pods” with common access and programming interfaces, prominently exemplified by the</p>
      <p>
        Solid Linked Data Platform infrastructure, and with the architecture described in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] of a decentralised
Web where Web applications operate on these pods, retrieving only relevant and permitted data from
them in order to function for that individual, only querying, with authorisation, for that data at the time
of need for processing. The motivation is that the individual is the source of truth and controller of their
own data rather than multiple potentially distinct providers in silos.
      </p>
      <p>Figure 1 shows two perspectives on this architecture: the user- and the app-centric views. In (a.), the
focus is on the user, and their individual selection of Web applications which they want to have some
access to their data. This only shows part of the picture, however, because only one user is shown.
Considered from the perspective of a single Web application, however, (b.) illustrates the likely scenario
of it having multiple users. A key element made visible by the perspective in (b.) is that pods are
personal knowledge graphs, and therefore that applications interacting with multiple pods are
performing a type of federated knowledge graph querying across the pods they are authorised to access.</p>
      <p>(a.) (b.)</p>
      <p>
        Figure 1 Decentralised Web applications based on personal data stores - (a.) user- and (b.) app-centric views
In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we argued for the following set of technical considerations which could be applied in order
for application providers to guarantee well-behaved query processing to users, with the hypothesis that
such decent behaviour could be a selling point in a competitive Web ecosystem.
      </p>
      <p>
        1. Traces: records of activity, e.g., input, intermediate, and output data to promote transparency.
2. Policy validation: machine-readable constraints on data usage for different purposes, based on
arbitrary constraints. Relies on policy validation engines to test a particular use of data against
a policy. See, e.g., [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
3. Verification: timestamped tamper-evident data in shareable formats, including verifiable
selective disclosure (e.g., [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). Known-good verifiable code.
4. Anonymity: interpersonal privacy respected both by platforms and by verifiable tracing &amp;
policy infrastructure - a trace and policy validation proof for one individual should not reveal
personal information of another.
      </p>
      <p>
        As well as policy evaluation and verifiable selective disclosure, this concept builds on existing work
on provenance tracing and trusted execution. Recalling that we consider decency to include not just
being well-behaved, but also being reassuring to users, it is important to look at how good behaviour
may be implemented and how it can be communicated. In terms of federated querying of knowledge
graphs, the implementation of tracing depends on each individual query, as data gathered from different
sources is processed with various operations by a query engine to produce results. Operations which,
for example, aggregate data can lead to individual results originating from multiple data sources, which
potentially can be anonymising (if operations cannot be reversed algorithmically or by estimation), or
transparent (if they can) - in the latter case, sharing a full trace with the subject of one data source may
reveal information about the subject of another. Provenance tracing in the database querying literature
(e.g., [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) identifies subtypes of provenance: lineage, why-, and how-provenance. In brief, for a query
result, its lineage is the set of data sources contributing to it, its why-provenance is the set of paths in
the query plan/execution between the result and the sources in the lineage, and its how-provenance is
(an algebraic representation of) the why-provenance with the query operators that are applied at each
stage along the path. For knowledge graph querying, [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] presents a means of tracking provenance of
queries by means of query rewriting, giving how-provenance (and therefore why-provenance and
lineage) for non-aggregating queries, and lineage for aggregating queries - with the specific advantage
of the query rewriting approach being its compatibility with arbitrary query engines. This is particularly
relevant in the decentralised/federated context where different data sources may be accessed via
different engines. At the time of writing, we could not identify any research into privacy/anonymity
analysis of algebraic how-provenance for knowledge graph queries. For communicating provenance
information, vocabularies such as the Provenance Ontology2 and VoID3 are well-established for
describing activities and datasets; there is however a gap with regard to common vocabularies for
describing how-provenance and query plans.
      </p>
      <p>
        Trusted execution and technologies to support it cover a range of interpretations of trust: see, e.g.,
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for a thorough overview. Often, these technologies include dedicated hardware in some form, to
minimise the possibility of malicious or misbehaving software causing an unfounded sense of trust.
While commonly-used trusted execution hardware (e.g., contactless payment cards) might only support
the execution of certain cryptographic operations on secure data, others can secure applications or
virtual machines. As an alternative to dedicated hardware, blockchain-based smart contract solutions
(e.g., the Ethereum4 Virtual Machine) supports trusted execution of public code by means of
highlydistributed public computation. In general, the security provided by blockchain technologies relies on
multiple parties executing the same code on the same data, which inherently decreases privacy and
performance without careful design to avoid risks. Recent work such as [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] explores trusted execution
for knowledge graphs using smart contracts.
2.1.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Scenario</title>
      <p>In order to understand the issues and difficulties, we walk through a scenario of an interaction that
takes the preceding considerations into account. For clarity, consider Figure 2, a scenario with only two
actors: a user, represented by their pod, and a Web application representing its provider. Both actors
have their own policies on the use of their data, and both have activity tracing. We look at the case of
the user seeking reassurance that the application is well-behaved with their data; in reality, of course,
this can work both ways. The application is seeking to query the user’s data, in line with their policies,
to produce output. We assume that this processing will also involve data from other users, and that the
output may be sent to further actors which may or may not include the user shown. We also, for the
sake of argument, make the significant simplifying assumptions that inputs and outputs are all Linked
Data, and that application providers are using open-source components for at least data transformation
and aggregation (if not, e.g., user interface). These are clearly highly unrealistic assumptions which
hide difficulties that actual application of these ideas would encounter; however, even with these
simplifications, there are many open questions to be addressed. Understanding the issues in the simple
scenario will hopefully then shed light on how to approach more complex real situations.
2 https://www.w3.org/TR/prov-o/
3 https://www.w3.org/TR/void/
4 https://ethereum.org/en/</p>
      <p>Taking the proposal of our previous paper, we step through a
hypothetical data query and processing scenario and consider key points at
which our suggested use of activity tracing, policy evaluation, verification,
and considerations of anonymity come into play. The specific mechanisms
of verification are omitted: digital signatures, distributed ledgers, or other
technologies could play this role. For reasons of space, we also omit most
depictions of specific activity tracing - every action in this scenario is
logged and traced by the relevant actors, and traces are anchored as
tamperevident. If absent, it should be assumed that it has taken place.</p>
      <sec id="sec-3-1">
        <title>Step 1. Code Verification</title>
        <p>Beginning from a point at which the application and user have authenticated
with each other, the user consents in principle to the application querying
and processing some of their data. The user will rely on activity traces and
policy evaluation from the
application to determine Figure 3 Key
behaviour, and therefore needs
to know that the code used for these purposes meets their
own policies and has not been modified to, e.g., create false
records of activity. As we have assumed components are
open source, the application can provide
cryptographicallybacked proof that the particular query and policy validation
engines to be used (indicated in Figure 4 by the lock symbol)
correspond to known-good public versions which can be
subject to community scrutiny. If relevant, the user can do
Figure 4 Code verification the same. Questions: how can we ensure the application
uses the components it presents, and does not substitute or
wrap them with something else that misbehaves? Can approaches to trusted execution be practically
used here?</p>
      </sec>
      <sec id="sec-3-2">
        <title>Step 2. Policy-based access control</title>
        <p>The application’s verified query engine (represented
by the magnifying glass in Figure 5 sends a query to
the user’s pod via the user’s policy validation engine.</p>
        <p>This reads the user’s data usage policy, and their pod,
and validates that the query and its results comply.</p>
        <p>The results are only returned to the application if this
succeeds. The data ({}) and a copy of the user’s
policy are then passed to the application by its query
engine. Questions: What policy formalisms are
suitable in terms of both expressiveness and
automated validation? What vocabularies and
granularities of the purposes of processing
activities are useful, and how is compliance with purpose to be evaluated?</p>
      </sec>
      <sec id="sec-3-3">
        <title>Step 3. Application processing</title>
        <p>Internal to the application, the user’s data is processed,
transformed, and aggregated with other users’ data to generate
outputs. The application’s verified policy validation engine has
access to the user’s policy, the (user’s contribution to) input
data, and the output data, as shown in Figure 6. It validates the
output with respect to the policy. Questions: What are the
relevant types of criterion to be expressed in policies that could
allow for automated validation of compliance on outputs, and,
for example, for anonymity, what are the useful metrics? What are the minimal sets of information
which traces need to include for this? And what are the trade-offs and constraints with regard to user
and others’ privacy, and application confidentiality, which affect and limit what can be done here?</p>
      </sec>
      <sec id="sec-3-4">
        <title>Step 4. Trace anonymisation</title>
        <p>The preceding step is the only step which involves data originating
from neither the user nor the application. Thus, while the
application’s activity traces of prior stages could potentially be
shared with the user, for the processing stage, this might no longer
be safe from a privacy perspective, as data might leak between users
via the traces. The application therefore needs to ensure that the
verifiable records of activity returned to one user cannot be used to
extract anything which can be identifiably linked to another, and
that anything which could enable that is sent only to the user it
identifies. Figure 7 illustrates this. Questions: This step may even Figure 7 Trace anonymisation
be unnecessary if verifiable automated policy evaluation can provide strong enough guarantees of
compliance. In the (not unlikely) event that it cannot, what are the possibilities and limitations in trying
to separate or selectively disclose trace contents?</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Conclusion</title>
      <p>We have described a scenario in which we attempt to bring a previously-identified set of technical
factors to bear on the task of making decentralised Linked Data querying transparent and traceable.
Specifically, we considered key stages in such a process and identified relevant open questions to be
answered to enable even a very simple scenario based on favourable assumptions to be handled. As we
progress ongoing work into infrastructure to implement these ideas, we are seeking to answer these
questions, among others that will undoubtedly arise, with the goal of laying the groundwork to build on
for realistic scenarios, to improve transparency and control over data usage for everyone.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J</given-names>
            <surname>Cheney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L</given-names>
            <surname>Chiticariu</surname>
          </string-name>
          and
          <string-name>
            <surname>W-C Tan</surname>
          </string-name>
          .
          <year>2009</year>
          . Provenance in Databases: Why, How, and
          <string-name>
            <surname>Where</surname>
          </string-name>
          . Foundations and Trends in Databases:
          <volume>1</volume>
          (
          <issue>4</issue>
          ), pp
          <fpage>379</fpage>
          -
          <lpage>474</lpage>
          . http://dx.doi.org/10.1561/1900000006.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V</given-names>
            <surname>Goretti</surname>
          </string-name>
          .
          <year>2022</year>
          .
          <article-title>Safe and controllable information consumption for data market applications</article-title>
          .
          <source>Masters thesis</source>
          , University of Rome La Sapienza, https://penni.wu.ac.at/supervision/Valerio%20Goretti%
          <fpage>20Master</fpage>
          %
          <fpage>20Thesis</fpage>
          %
          <fpage>202022</fpage>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D</given-names>
            <surname>Hernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L</given-names>
            <surname>Galárraga</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K Hose.
          <year>2021</year>
          .
          <article-title>Computing how-provenance for SPARQL queries via query rewriting</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <issue>13</issue>
          ),
          <fpage>3389</fpage>
          -
          <lpage>3401</lpage>
          . https://doi.org/10.14778/3484224.3484235.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E</given-names>
            <surname>Mansour</surname>
          </string-name>
          , AV Sambra,
          <string-name>
            <given-names>S</given-names>
            <surname>Hawke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Zereba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S</given-names>
            <surname>Capadisli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            <surname>Ghanem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            <surname>Aboulnaga</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A Demonstration of the Solid Platform for Social Web Applications</article-title>
          .
          <source>In Proceedings of the 25th International Conference Companion on World Wide Web (WWW '16 Companion)</source>
          .
          <source>Switzerland</source>
          ,
          <volume>223</volume>
          -
          <fpage>226</fpage>
          . https://doi.org/10.1145/2872518.2890529.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C</given-names>
            <surname>Shepherd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G</given-names>
            <surname>Arfaoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I</given-names>
            <surname>Gurulian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>RP</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>K Markantonakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>RN</given-names>
            <surname>Akram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D</given-names>
            <surname>Sauveron</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E</given-names>
            <surname>Conchon</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Secure and Trusted Execution: Past, Present, and Future - A Critical Review in the Context of the Internet of Things and Cyber-Physical Systems</article-title>
          . IEEE Trustcom/ BigDataSE/ ISPA, Tianjin, China,
          <year>2016</year>
          , pp.
          <fpage>168</fpage>
          -
          <lpage>177</lpage>
          , https://doi.org/10.1109/TrustCom.
          <year>2016</year>
          .
          <volume>0060</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S</given-names>
            <surname>Steyskal</surname>
          </string-name>
          and
          <string-name>
            <given-names>S</given-names>
            <surname>Kirrane</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>If you can't enforce it, contract it: Enforceability in PolicyDriven (Linked) Data Markets</article-title>
          .
          <source>International Conference on Semantic Systems</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A</given-names>
            <surname>Third</surname>
          </string-name>
          and
          <string-name>
            <given-names>J</given-names>
            <surname>Domingue</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>LinkChains: Trusted Personal Linked Data</article-title>
          . In: Blockchainenabled Semantic Web,
          <source>International Semantic Web Conference</source>
          <year>2019</year>
          , Auckland, New Zealand.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A</given-names>
            <surname>Third</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J</given-names>
            <surname>Domingue</surname>
          </string-name>
          .
          <year>2023</year>
          .
          <article-title>Decency and Decentralisation: Verifiable Decentralised Knowledge Graph Querying. Trusting Decentralised Knowledge Graphs and Web Data at the Web Conference</article-title>
          . http://oro.open.ac.uk/88166/.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R</given-names>
            <surname>Verborgh</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Paradigm Shifts for the Decentralized Web</article-title>
          , Retrieved:
          <fpage>2023</fpage>
          -02-14 https://ruben.verborgh.org/blog/2017/12/20/paradigm-shifts
          <article-title>-for-the-decentralized-web/.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>