<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Query-Driven Data Minimization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Peter K. Schwab</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julian O. Matschinske</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas M. Wahl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Klaus Meyer-Wegener</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FAU Erlangen-Nu ̈rnberg</institution>
          ,
          <addr-line>Informatik 6, Datenmanagement</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Due to the data privacy laws inured recently, data minimization is a highly topical issue. So far, many companies have collected and processed a vast amount of personal data while treating data privacy negligently. They now need to narrow down collection and processing, but do not know, which data is essential for transacting their businesses. To overcome these problems, we propose query-driven data minimization. Our approach grants user-specific access to database schemas based on SQL reports and supports deletion of unneeded schema elements. In this paper, we outline the schema of a query repository for managing query meta-information. We introduce a user story emphasizing the benefits of our approach for query-driven data minimization and we sketch the basic architecture of a framework implementing our approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The EU General Data Protection Regulation1 (GDPR) recently took effect and
requires substantial changes for almost every institution storing personal data.
For example, Art 5 GDPR postulates the principle of data minimization, i. e.
to collect, process, and use as little personal data as possible. However, it is a
tremendous effort to evaluate which data must really be collected and which
users are allowed to use them.</p>
      <p>On one side, there are data scientists with access to large parts of the
companies’ datasets. They are often unaware of data privacy and will not restrict
data access by themselves. On the other side, there are privacy officers who are
well aware of data privacy. They usually cannot prohibit fine-grained, technical
data access, e. g. based on SQL queries, but only step in when data collection or
processing has already taken place.</p>
      <p>
        The data-privacy community advocates privacy by design [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which
incorporates privacy protection into the overall design of technical systems. In this
paper, we take up privacy by design and propose a minimally-intrusive approach
regarding query-driven data minimization in existing relational IT landscapes.
We enable privacy officers to grant customized access to relational databases
based on the queries that users run on these systems. Furthermore, our
approach shows which data are actually not used in any data processing and lists
the queries that have inserted the data into the database. Privacy officers can
trace and delete unneeded data as well as prevent their future collection. To
reach this, we have extended an initial design for a query repository proposed in
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] in order to log queries and enrich them with meta-information.
The query repository records SQL-based queries and their characteristics. Fig. 1
shows an Entity-Relationship diagram with its conceptual schema.
      </p>
      <p>Description</p>
      <p>Priority</p>
      <p>Legitimacy</p>
      <p>Query
Environment</p>
      <p>DBName
Host</p>
      <p>Port
Username
Credentials</p>
      <p>Query
Context ContextName</p>
      <sec id="sec-1-1">
        <title>Database</title>
      </sec>
      <sec id="sec-1-2">
        <title>User</title>
        <p>Query Context
1
1
runs
at
1
1
happens in</p>
        <p>N</p>
      </sec>
      <sec id="sec-1-3">
        <title>Query</title>
      </sec>
      <sec id="sec-1-4">
        <title>Execution</title>
      </sec>
      <sec id="sec-1-5">
        <title>Query</title>
        <p>Query
Execution
Timestamp</p>
        <p>RunTime
ErrorMessage</p>
        <p>Text
Relations
Query Structure</p>
        <p>RelName
Attributes</p>
        <p>While most entities, relationships, and attributes are self-explanatory, Query
Context needs some remarks. It has an identifying ContextName and an optional
Description. These attributes point out a user’s intention or the situation in which
she runs a query via the relationship happens in. If a Query Execution happens
in a certain Query Context, the user can additionally specify the query’s priority
and its legitimacy in this context. A query execution can only be done in one
context, but a context may contain many query executions.</p>
        <p>We organize query characteristics in four groups highlighted in color in Fig. 1.
The first one is the Query Structure that we derive automatically from the query
itself. The other groups hold meta-information on the queries. The Query
Environment is derived from the user and her initial connection to the target database.
The Query Execution stems from the query, the executing user, and the database.
Finally, the Query Context cannot be derived automatically; it must be entered
manually. It typically contains tacit user knowledge, which we externalize here.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Benefits of a Query Repository</title>
      <p>To explain the benefits of a query repository, we describe a user story from a
clinical research scenario: The databases of a clinical information system (CIS)
store data required for the clinical operation. The hospital’s employees access
these databases directly in sessions using SQL-based tooling and generate
reports for their purposes. For example, Bob is responsible for billing of services
provided by the hospital. He reads the patients’ address data as well as ICD
(International Classification of Diseases) codes stored in case files. Doctor Carol
generates reports for a clinical study based on case files (“secondary use” of
data). She reads medication data as well as data regarding the patients’ health
status. Our query repository (QRep) logs all queries submitted to the CIS and
automatically derives related query characteristics.</p>
      <p>In the past, Bob’s and Carol’s access to the databases was unrestricted, i. e.
Bob had access to medication data, although not relevant for billing purposes,
and Carol was able to see address data not needed for her clinical study. We are in
June 2018 and Alice, the hospital’s privacy officer, wants to apply Art 5 GDPR in
terms of data minimization. Therefore, she displays Bob’s report results together
with its related queries in the QRep WebApp. She marks all queries as ‘legal’ if the
reports are compliant with Art 5 GDPR. Otherwise, she identifies unauthorized
queries and marks them as ‘illegal’. The QRep server stores her markers as Query
Context information. Finally, Alice repeats the procedure for Carol’s reports.</p>
      <p>Next, Alice clicks a button and the QRep server determines all schema
attributes and relations used in the queries marked as ‘legal’, grouped by database
user. Our tool then retrieves the entire schema information from the databases,
requests Alice to decide each user’s access rights to unused schema elements and
finally enforces these rights to the databases. In the example, Carol must no
longer access the entire relation with the patient’s address data, and Bob has
access only to the ICD attribute in relation to the case-related diagnostic data
of the patients. This ensures data minimization regarding data processing.</p>
      <p>Alice clicks another button and our tool determines the schema elements to
which no user is granted read access anymore. It also lists the queries that have
inserted the data. Alice can see now that there is, e. g. an attribute holding a
patient’s nationality that is not needed for any data processing and can delete
this attribute by a single click. She also sees that employee Dave ran the related
queries, contacts him and ensures data minimization regarding data collection.
4
To maintain compliance with data-protection regulation, responsible officers can
use our WebApp. It provides an easy-to-use interface and does not require any
knowledge of SQL. The WebApp interacts with the QRep server via a JSON-based
REST API. The overall system architecture is shown in Fig. 2. The
QueryManager passes incoming SQL queries to the Analyzer module, which enriches them
with Query Structure characteristics by using Apache Calcite as parser. The DB
Analyzer connects with target databases to determine Query Execution
characteristics. Users can add Query Context characteristics via WebApp. The QRep
DB persists this information. The SearchModule allows for browsing queries and
their characteristics and displaying report results. The DB Analyzer also provides
full target schema information for detecting unneeded relations and attributes.</p>
      <sec id="sec-2-1">
        <title>Data Sources</title>
      </sec>
      <sec id="sec-2-2">
        <title>QRep DB</title>
      </sec>
      <sec id="sec-2-3">
        <title>Apache Calcite</title>
      </sec>
      <sec id="sec-2-4">
        <title>Query Analyzer</title>
      </sec>
      <sec id="sec-2-5">
        <title>Search Module</title>
      </sec>
      <sec id="sec-2-6">
        <title>Log Reader</title>
      </sec>
      <sec id="sec-2-7">
        <title>DB Analyzer</title>
      </sec>
      <sec id="sec-2-8">
        <title>QRep WebApp</title>
      </sec>
      <sec id="sec-2-9">
        <title>Query Manager</title>
        <p>QRep Server</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5 Related Work</title>
      <p>
        Khoussainova et al. propose dedicated systems for query management [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
However, they focus on query completion [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Srivastava et al. provide a relational
framework for managing queries [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We follow their approach and manage
queries and their meta-information together in a relational model. A tool for
session-based query-log browsing is suggested in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, queries are not
enriched with meta-information and browsing is limited to keyword-based search.
Van den Bussche et al. provide user-defined functions to query ASTs stored
as XML files using a relational database [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Query context and other
metainformation are not considered. This is also the case for several graph-based
approaches for representing SQL queries, e. g. [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we already analyzed
query logs to facilitate data integration. However, we did not yet provide a way
to report user-specific meta-information about queries. Manta and SQLDep2
visualize schema lineage of SQL queries without considering query context.
      </p>
    </sec>
    <sec id="sec-4">
      <title>6 Summary and Future Work</title>
      <p>Our framework supports data minimization regarding the collection and
processing of data. The query-driven approach allows to customize user access to
the data and to identify unneeded schema elements. We plan to evaluate our
approach based on publicly accessible query logs. Regarding data minimization,
we also aim to consider unneeded tuples.</p>
      <p>Acknowledgement: The authors would like to thank the anonymous reviewers for their
valuable remarks.
2 https://getmanta.com/ and https://sqldep.com/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Van den Bussche, J.,
          <string-name>
            <surname>Vansummeren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vossen</surname>
          </string-name>
          , G.:
          <article-title>Towards practical metaquerying</article-title>
          .
          <source>Information Systems</source>
          <volume>30</volume>
          (
          <issue>4</issue>
          ),
          <fpage>317</fpage>
          -
          <lpage>332</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Khoussainova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , et al.:
          <article-title>Session-based browsing for more effective query reuse</article-title>
          .
          <source>In: Proc. SSDBM</source>
          . pp.
          <fpage>583</fpage>
          -
          <lpage>585</lpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Khoussainova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balazinska</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gatterbauer</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kwon</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suciu</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>A case for a collaborative query management system</article-title>
          .
          <source>In: Proc. CIDR</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Khoussainova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kwon</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balazinska</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suciu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Snipsuggest:
          <article-title>Contextaware autocompletion for SQL</article-title>
          .
          <source>Proc. VLDB</source>
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <fpage>22</fpage>
          -
          <lpage>33</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Koutrika</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simitsis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ioannidis</surname>
            ,
            <given-names>Y.E.</given-names>
          </string-name>
          :
          <article-title>Explaining structured queries in natural language</article-title>
          .
          <source>In: Proc. ICDE</source>
          . pp.
          <fpage>333</fpage>
          -
          <lpage>344</lpage>
          . IEEE (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Papastefanatos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.:
          <article-title>Hecataeus: A framework for representing SQL constructs as graphs</article-title>
          .
          <source>In: Proc. EMMSAD</source>
          . vol.
          <volume>5</volume>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Schaar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Privacy by design</article-title>
          .
          <source>Identity in the Information Society</source>
          <volume>3</volume>
          (
          <issue>2</issue>
          ),
          <fpage>267</fpage>
          -
          <lpage>274</lpage>
          (
          <year>Aug 2010</year>
          ). https://doi.org/10.1007/s12394-010-0055-x
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Query-driven data integration (short paper)</article-title>
          .
          <source>In: Proc. LWDA. CEUR Workshop Proc.</source>
          , vol.
          <volume>1670</volume>
          , pp.
          <fpage>206</fpage>
          -
          <lpage>211</lpage>
          . CEUR-WS (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velegrakis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Intensional associations between data and metadata</article-title>
          .
          <source>In: Proc. SIGMOD</source>
          . pp.
          <fpage>401</fpage>
          -
          <lpage>412</lpage>
          . ACM (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wahl</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          , et al.:
          <article-title>Query-driven knowledge-sharing for data integration and collaborative data science</article-title>
          .
          <source>In: Proc. ADBIS</source>
          . pp.
          <fpage>63</fpage>
          -
          <lpage>72</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>