<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SPIRIT: A Semantic Transparency and Compliance Stack</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Patrick Westphal</string-name>
          <email>patrick.westphal@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Javier D. Fernandez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabrina Kirrane</string-name>
          <email>sabrina.kirraneg@wu.ac.at</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <email>jens.lehmann@iais.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Complexity Science Hub Vienna</institution>
          ,
          <addr-line>AT</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Enterprise Information Systems</institution>
          ,
          <addr-line>Fraunhofer IAIS, DE</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute for Applied Informatics (InfAI), University of Leipzig</institution>
          ,
          <addr-line>DE</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Vienna University of Economics and Business</institution>
          ,
          <addr-line>AT</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The European General Data Protection Regulation (GDPR) sets new precedents for the processing of personal data. In this paper, we propose an architecture that provides an automated means to enable transparency with respect to personal data processing and sharing transactions and compliance checking with respect to data subject usage policies and GDPR legislative obligations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>The SANSA Stack</title>
      <p>The current \Big Data landscape" provides a plethora of tools and frameworks
covering a variety of methods and techniques for processing huge amounts of data
via a distributed cluster of machines. However, none of the general purpose Big
Data processing frameworks provide built-in support for processing big semantic
data e.g. to load and store RDF data, which, as a uniform data format supports
dealing with heterogeneity of Big Data. This gap is tackled by the Semantic
Analytics Stack (SANSA)5 [1] which is an open source semantic data processing
framework built on top of Apache Spark6 and Apache Flink7 SANSA provides
a stack of functional layers ranging from RDF/OWL data representation to
machine learning algorithms working on semantic data.</p>
      <p>The Knowledge Distribution and Representation layer provides a means to
read and write RDF and OWL les. In terms of data structures and programming
interfaces SANSA follows the common and accepted representations of Apache
Jena8 and the OWL API9. Hence, the RDF and OWL data is provided as
distributed collections of Apache Jena triples and OWL API axioms, respectively.
On top of this, the Query layer comprises functionality for searching,
exploring and extracting information from big semantic data through the SPARQL
query language. SANSA supports executing SPARQL queries within an Apache
Spark/Flink program, or via an HTTP SPARQL endpoint. In both cases the
actual queries are translated into lower level Apache Spark/Flink data processing
instructions and executed on the Knowledge Distribution and Representation
layer. The next layer in the SANSA Stack is the Inference layer which builds on
the layers mentioned so far. Besides actual data-level assertions, the Semantic
Web technology stack also provides a means to express schema or ontological
knowledge. Parts of the inherent semantics of the respective W3C standards,
RDFS and OWL, may be encoded as rules which can be applied to infer new
knowledge. With this forward chaining process all rule-based inferences may
be materialized. In contrast backward chaining techniques infer new knowledge
starting at a given `goal', which can be a (set of) RDF triple(s). SANSA
supports di erent existing reasoning pro les for rule-based forward/backward
chaining. Apart from these pro les, SANSA is able to compute an e cient execution
plan for arbitrary sets of rules. Hence, users can adjust the trade-of between
expressivity and performance, and furthermore introduce custom rules, e.g. to
represent business policies. On top of the SANSA Stack the Machine Learning
layer provides a collection of machine learning algorithms that can directly work
on RDF triples or OWL axioms. The algorithms implemented thus far cover
knowledge graph embeddings [2] (e.g. for link prediction), graph clustering and
association rule mining techniques.
3</p>
      <p>SPIRIT: Leveraging the SANSA Stack for</p>
      <p>Transparency and Compliance
In this paper, we introduce our transparency and compliance checking
implementation of the SANSA stack, which is depicted in Figure 1. The SANSA-based
5 SANSA Stack home page, http://sansa-stack.net
6 Apache Spark, https://spark.apache.org
7 Apache Flink, https://flink.apache.org
8 Apache Jena, http://jena.apache.org/
9 OWL API, https://owlcs.github.io/owlapi/
l
o
g
i
n</p>
      <sec id="sec-2-1">
        <title>Company (u</title>
        <p>s
e
r
I
d
)</p>
      </sec>
      <sec id="sec-2-2">
        <title>Data</title>
      </sec>
      <sec id="sec-2-3">
        <title>Subject</title>
      </sec>
      <sec id="sec-2-4">
        <title>Dash board</title>
        <p>Data
Purpose
Processing
Storage</p>
        <p>Sharing
CRCMRCMRM</p>
      </sec>
      <sec id="sec-2-5">
        <title>Logs</title>
        <p>T
L
S</p>
      </sec>
      <sec id="sec-2-6">
        <title>Line of Business</title>
      </sec>
      <sec id="sec-2-7">
        <title>Applications</title>
        <p>Async./Pul
Data Stream
Policies
…
OWL +</p>
      </sec>
      <sec id="sec-2-8">
        <title>Rules</title>
      </sec>
      <sec id="sec-2-9">
        <title>Distrib. FS</title>
      </sec>
      <sec id="sec-2-10">
        <title>Converter</title>
        <p>TLS</p>
        <p>}
ircveeS ilisceoP HTTPS derqwfeuhaeogsrleoyetnELUenosrggei:r=nTeRrnDae=Dnwsn=aRecewNtaTisQroouinnepserl(ryeu(ERsrneeugarlidIenedser,(.+wlqhoupoaerldroe(yfL.)io.lg.{e+),axqiuoemrsy)Engine) ////// (((432)))
} return reasoner.getInf(query): Collection[Triple] // (5)</p>
      </sec>
      <sec id="sec-2-11">
        <title>Machine Learning</title>
      </sec>
      <sec id="sec-2-12">
        <title>Inference</title>
      </sec>
      <sec id="sec-2-13">
        <title>Querying</title>
      </sec>
      <sec id="sec-2-14">
        <title>Knowledge Distribution &amp; Representation</title>
      </sec>
      <sec id="sec-2-15">
        <title>Distributed In-Memory Processing</title>
      </sec>
      <sec id="sec-2-16">
        <title>Distributed Filesystem</title>
      </sec>
      <sec id="sec-2-17">
        <title>Node Node</title>
      </sec>
      <sec id="sec-2-18">
        <title>Node</title>
      </sec>
      <sec id="sec-2-19">
        <title>Node …</title>
        <p>SPIRIT: A Semantic Transparency and Compliance Stack</p>
        <p>Non-Big Data Big Data</p>
      </sec>
      <sec id="sec-2-20">
        <title>Business Logic</title>
        <p>def main(…) {
query = constructQuery()
submit(App.main,</p>
        <p>userId, query)
presentResults()</p>
        <p>
          App
HTTPS def main(args) {
userId = args(0); val query = args(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
axioms, rules = loadPolicies(userId) // (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
        </p>
        <p>HTTPS return getUserTransactions(userId, query)
transparency and compliance checking application (right) is used to analyse log
information concerning personal data processing and sharing that is output from
line of business applications on a continuous basis (bottom left), and to present
the information to the user via the SPIRIT dashboard (top left).</p>
        <p>Ingesting Transaction Logs into SPIRIT: When it comes to personal
data processing there is a need for a general mechanism to verify compliance
with existing usage policies and legal obligations. One such mechanisms is the
recommissioning of existing application and system logs such that they can be
used to verify that data processing and sharing complies with usage policies
speci ed by the data subject. Considering the sheer volume of data generated
when application logs are used for personal data processing and sharing auditing,
there is a need for a le system that is able to handle Big Data, is fault tolerant,
and is capable of supporting parallel processing. The Hadoop Distributed File
System (HDFS)10 ful lls all those criteria and is the default choice for Apache
Spark and Apache Flink. Moreover, there is a stable and mature solution to
transfer log data to HDFS, called Apache Flume11, which provides a means to
transform log content, e.g. obtained from an application log, before it is passed
along to the HDFS. This allows heterogeneous transaction logs to be translated
to RDF on the y.</p>
        <p>
          SPIRIT Transaction Log Processing with SANSA: Our SANSA-based
architecture allows storage and access to all log data in a Big Data
processing environment. Semantic web technologies ease data integration across several
heterogeneous line of business applications, enabling interoperability across
platforms and providing a simple way to link user data and policies. As sketched
in Figure 1 the main steps that need to be performed include: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) loading the
10 HDFS, http://hadoop.apache.org/
11 Apache Flume, http://flume.apache.org/
policies from the policy store and dividing them into rules that are used in the
reasoning step, and schema/ontology axioms added to the log data later; (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
loading the RDF log data stored in the distributed le system; (3) initialising
a query engine with the log and schema/ontology data; (4) creating a reasoner
which works on the query engine, considers the rules from the policy store and a
set reasoning pro le; and eventually (5) invoking the backward chaining on the
given query goal. Our SPIRIT architecture o ers transparency for data subjects,
and means to verify that all business processes comply both with the consent
provided by the data subject and relevant obligations from the GDPR by: (i)
encoding user data policies in (subsets of) OWL 2 DL; and (ii) providing a
compliance checking mechanisms on the basis of the the SANSA inference rule-engine.
As for the former, we allow policies to de ne restrictions in terms of ve data
categories related to the GDPR regulation (as depicted in Figure 1): Data re ects
which personal data is governed by the policy. Processing lists the operations
(e.g. anonymisation, aggregation, etc.) performed on the personal data. Purpose
describes why data are collected/processed. Storage concerns where data are
stored and for how long. Sharing speci es the potential use of the personal data
by third parties. In addition to the personal data policies, the SPIRIT
architecture holds rules that provide means to check compliance of data processing
and sharing transactions according to the data policies and GDPR regulations.
Acknowledging that GDPR compliance checking cannot fully automated (given
the generality, vagueness and subjectivity inherent in the regulation), we focus
on verifying minimal sets of conditions (\if condition X holds then the data
policy Y is violated") to assist the stakeholders in charge of providing evidence of
GDPR compliance.
        </p>
        <p>The SPIRIT Dashboard: The SPIRIT dashboard provides a means for
data subjects, companies and supervisory authorities to obtain transparency
with respect to the processing of personal data and compliance with respect to
the data subjects usage policies. A user request is converted into a query which
is passed to the SANSA application, together with a user identi er. The results
are then passed back to the dashboard to be presented to the user.
Acknowledgments. This work is partially funded by the European Union's
Horizon 2020 research and innovation programme grants 732194 (QROWD) and
731601 (SPECIAL), and the Austrian Research Promotion Agency (FFG) grant
861213 (CitySPIN).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sejdiu</surname>
          </string-name>
          , L. Buhmann, P. Westphal,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stadler</surname>
          </string-name>
          , I. Ermilov,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          .
          <article-title>Distributed semantic analytics using the sansa stack</article-title>
          .
          <source>In Proceedings of the 16th International Semantic Web Conference (ISWC)</source>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Nickel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tresp</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          .
          <article-title>A review of relational machine learning for knowledge graphs</article-title>
          .
          <source>Proceedings of the IEEE</source>
          ,
          <volume>104</volume>
          (
          <issue>1</issue>
          ):
          <volume>11</volume>
          {
          <fpage>33</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>