<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Query-Driven Enforcement of Rule-Based Policies for Data-Privacy Compliance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Peter K. Schwab</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maximilian S. Langohr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonas Röckl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Demian E. Vöhringer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas M. Wahl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Klaus Meyer-Wegener</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FAU Erlangen-Nürnberg</institution>
          ,
          <addr-line>Lehrstuhl für Informatik 6, Datenmanagement</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data privacy is currently a topic in vogue for many organizations. Many of them run enterprise data lakes as data source for an ungoverned ecosystem, wherein they have no overview concerning data processing. They aim for mechanisms that require unsophisticated implementation, are easy to use, assume as little technical knowledge as possible, and enable their privacy officers to determine who processes which data when and for which purpose. To overcome these challenges, we present a framework for query-driven enforcement of rule-based policies to achieve data-privacy compliance. Our framework can be integrated minimally intrusive in existing IT landscapes. In contrast to existing approaches, privacy officers do not require profound technical knowledge because our framework also enables non-experts in evaluating data processing in SQL queries by intuitionally comprehensive, tree-shaped visualizations. Queries can be classified as legal or illegal regarding dataprivacy compliance. We provide a domain-specific language for defining policy rules that can be enforced automatically and in real-time.</p>
      </abstract>
      <kwd-group>
        <kwd>Classification Rules</kwd>
        <kwd>Data Privacy</kwd>
        <kwd>Query-Driven</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Data privacy is a trending topic, especially due to the EU General Data
Protection Regulation (GDPR) taken effect in May 2018 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This law sets standards
for companies regarding data processing as high as never before [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However,
many companies are badly prepared and have invested only insufficient
financial and human resources to fulfill the GDPR needs [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Especially ungoverned
ecosystems around enterprise data lakes pose tremendous challenges concerning
data-privacy compliance. While these common data storage and processing
environments democratize access to data sources within an organization, there often
has not yet been established a suitable infrastructure for monitoring data usage.
      </p>
      <p>Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>
        Problem Statement: To overcome these challenges, companies need guidance
and support to transform their businesses to comply with the GDPR regulations
in an efficient way. Data processing in today’s world still happens in many cases
via SQL. However, privacy officers frequently do not have even basic SQL
knowledge because database (DB) users are becoming more and more non-experts [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Besides, SQL queries are often embedded in dedicated software applications.
This means, they cannot be reviewed separately. Instead, privacy officers need
even more comprehensive knowledge in this case because they also have to review
the source code generating the queries. So the companies aim for mechanisms
that require unsophisticated implementation, are easy to use, assume as little
technical knowledge as possible, and enable their privacy officers to determine
who processes which data when and for which purpose.
      </p>
      <p>Contribution: We present a framework for query-driven enforcement of
rulebased policies to achieve data-privacy compliance. Our approach enables
realtime data-privacy compliance in data processing for already existing system
landscapes. In order to trustworthily assess data processing, our approach focuses on
the queries running on the companies’ target DBs. Privacy officers need not be
SQL experts for this assessment, because we visualize data processing for them.
Visualization is based on a tree-shaped structure that is intuitionally
comprehensive, and illustrates a query’s whole data flow, not only its result.</p>
      <p>
        The implementation of our prototype is work in progress. Our framework is
minimally intrusive, since it does not significantly affect productive operation.
It is inserted between end-users’ SQL tooling and the target DBs, it intercepts
submitted queries and logs them in our internal query repository. The privacy
officers can classify logged queries as legal or illegal regarding internal
dataprivacy regulations by means of rules based on the nodes of the tree-shaped
query representation. Newly submitted queries are checked in real-time against
a pre-filtered, compatible set of all defined rules. As soon as a rule matches, the
respective query is prohibited to run on the target DBs. This approach extends
the framework for query-driven data minimization we presented in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>This paper presents a user story illustrating the basic idea and benefits of
our approach (cf. sec. 2), sketches the framework’s architecture and reference
implementation (cf. sec. 3), and proposes an evaluation concept (cf. sec. 4).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Query-driven Rule-based Policy Enforcement</title>
      <p>
        To explain the basic idea behind our approach and to illustrate its benefits, we
take up the user story based on a clinical research scenario described in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
and expand it by the process of query-driven rule-based policy enforcement.
In this scenario, the hospital’s employees access the target DBs of their clinical
information system (CIS) directly via SQL-based tools. There are administration
secretaries and doctors generating different reports for their purposes. And there
is Alice, the hospital’s privacy officer. As already described in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], our system is
inserted between the employees’ tooling and the productive CIS target DBs. Our
system intercepts and parses submitted queries, automatically derives related
characteristics, like the query’s runtime and processed schema elements, i. e.
relations and attributes, and logs everything to our internal query repository.
      </p>
      <p>This repository is the base for our approach. This paper describes functionality</p>
      <sec id="sec-2-1">
        <title>Query</title>
        <p>that notably exceeds the repository’s function range.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Driven_Policy_Enforcement_Frontend</title>
        <p>LogcinlietontWeb</p>
        <p>Search for
critical queries
via visual search
form</p>
        <p>List of
potentially
illegal
queries</p>
        <p>Alice
s
s
e
trco
P
n
e
m
e
frco
n
E
y
c
ilPo lice
d A
e
s
a
B
le
u
R
n
e
-rrvy
i
D
e
u
Q</p>
        <p>New audit
session
Query classification
Potentially
illegal
query
Audit
complete
clasLsiefigcaaltion</p>
        <p>Illegal
classification
Evaluate query</p>
        <p>Considered
as illegal?
o
n
s
e
y</p>
        <p>Classify as legal
Classify as illegal</p>
        <p>Define
classification
rule</p>
        <p>
          Query-bunch
classification and
rule refinement
Query Auditing: Now, we enhance this scenario because Alice also wants to
ensure data-privacy compliance for the hospital’s data processing. The BPMN
diagram in fig. 1 illustrates her part in this user story. As she only has little
technical knowledge regarding SQL, our framework supports her with some
visualization tools. First, Alice performs an audit based on the queries logged in
our internal query repository. She browses this repository for critical queries by
use of the visual query search form in our Web client. It is introduced in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and
allows for filtering queries by users and usage of schema elements. As she is well
acquainted with the data stored in the CIS DBs, she quickly finds some critical
queries that are potentially illegal regarding data-privacy compliance. Alice is
no expert in understanding SQL. Our system supports her in comprehending a
query’s data processing by visualizing the data flow through the query based on
an intuitionally comprehensible, tree-shaped structure. Every tree node
represents an operation performed in this query and includes a verbal description of
this operation. SQL functionality is reduced to the manageable set of relational
operators and nested sub-queries are already resolved.
        </p>
        <p>Rule Definition: Alice evaluates the critical queries in a row: If she considers a
query to be in violation of any data-privacy regulation, she classifies it as illegal.
In addition, Alice defines a rule that generalizes the query’s parts containing
the illegal data processing. She is supposed to define this rule because only she
has adequate knowledge and skills to properly assess which parts of a query
violate current data-protection regulations. Generalization of query parts means
for example resolving query-specific renaming of schema attributes. To achieve
this, rule definition is based on the visualized tree and its logical operator nodes.</p>
        <p>To illustrate some examples for basic rules, we assume two tables A and B
holding personal data of a larger group of people. Table A holds characteristics
that easily allow to identify a certain person, whereas the characteristics in table
B are equal for many of these people:</p>
        <p>A( pseudonym, name, street, city, date_of_birth )</p>
        <p>B( pseudonym[A], gender, race, professional_group )</p>
        <p>Alice defines the following basic rules. They can be combined and nested
arbitrarily to prevent any kind of illegal data processing. In a separate field of
the rule-definition dialog, there can also be stated an explanation and a reason.</p>
        <p>Rule #1: Joining tables A and B via the pseudonym attribute is illegal because
the result contains personal-related information.</p>
        <p>Rule #2: The filter «B.gender = ’F’ AND B.professional_group = ’Civil
engineering’» is illegal because result set is so small that it could allow drawing
conclusions on personal-related information.</p>
        <p>Rule #3: The filter «COUNT(*) &lt; 5» after any grouping in table B is illegal
because it can reveal possible filter criteria to generate queries with a small result
set that could allow drawing conclusions on personal-related information.
Rule Refinement: When Alice adds a new classification rule to our system,
the data operations of the queries logged in the internal query repository are
analyzed and the Web client displays a list of all queries matching with the
rule definition. These queries are similar to the originally classified query in the
sense that they contain the same logical operators working on the same schema
elements as defined in the rule. Alice can filter these queries by several criteria,
e. g. by use of certain schema elements, and hereby classify a bunch of queries in
a single click. When needed, she makes use of the tree-shaped visualization to
understand a query’s data processing. If she comes across a query that matches
a rule, but in spite of that, Alice considers it legal, she classifies this query
respectively and refines the related rule definition, for example by adding further
matching criteria. This increases the rules’ quality and classification reliability.
Real-Time Policy Enforcement: As soon as Alice has defined the first rule,
automatic real-time policy enforcement is enabled. The BPMN diagram in fig. 2
shows the interaction of the system’s backend server with the end-users and all
DBs involved in this enforcement:</p>
        <p>When our system intercepts a submitted query, it not only logs it in the
internal query repository, but also checks it against a pre-filtered, compatible
set of all defined rules. If a single rule matches for this query, it is prohibited
to run on the target DBs. Finally, instead of forwarding the query’s result to
the end-user’s tool, a respective policy message including further information of
the matching rule is sent. This guarantees an automatic and real-time policy
enforcement for data-privacy compliance.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Query</title>
        <p>Driven_Policy_EnforcQeumerye-Dnritv_enBEancfokrceemnednt of Rule-Based Policies</p>
        <sec id="sec-2-3-1">
          <title>Hospital employee (via SQL tooling)</title>
          <p>iIttrrrccssyeeePoopnn rrcsvkaeeednB
e
u
Q
Real-time policy enforcement</p>
          <p>Begin
enfoprocleicmyent
New
squubemryitted
Analyze query
operations
Match against
classification
rules
Derive query
characteristics
no Log qmueetray- and
information
se
y
Classification
rules defined?</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Quercyoimntpelrecteeption</title>
          <p>Backend server</p>
          <p>Route query to
CIS target DB</p>
          <p>Forward
query result
querIyntreerpnoaslitory
rulesInrteeprnoasiltory</p>
          <p>DB 2</p>
          <p>
            DB 3
CIS Target DBs
This paper provides query-driven rule-based policy enforcement for data-privacy
compliance. For this purpose, we enhance a framework for query-driven data
minimization presented in [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. Fig. 3 shows the underlying client-server architecture.
The server is implemented in Java; the multi-user client uses contemporary Web
technologies. Client-server communication follows a RESTful JSON API over
HTTP using the Spring Framework2. Components added within this paper’s
scope are highlighted in the figure by a green background, components from
external sources with dashed and thick boundary lines. The new components’
functionality and their interaction with the whole framework are covered below.
The Query operator analysis module uses Apache Calcite3 for parsing a query
into a logical query plan. Calcite handles standard SQL and several of its
dialects. Dialects not supported by Calcite, e. g. Presto, Athena, or Teiid4, are also
not supported by our system. Calcite obtains the required schema information
directly from the target DBs. Our external DB interface supports all relational
DBMS compliant to the JDBC API.
          </p>
          <p>To ease further analysis, Calcite’s query planner also performs a query
normalization. This includes resolving multi-joins so that each node in the query
2 http://spring.io/projects/spring-framework
3 https://calcite.apache.org/
4 https://prestosql.io, https://aws.amazon.com/athena and http://teiid.io</p>
          <p>Prod. target DBs
External DB interface</p>
          <p>Apache Calcite</p>
          <p>ClCielCinelitnetnt</p>
          <p>Client API</p>
          <p>Server
controller</p>
          <p>Spring
framework
Query schema
analysis module</p>
          <p>Query operator
analysis module</p>
          <p>Query classification</p>
          <p>module</p>
          <p>Internal DB interface
Server</p>
          <p>Internal
relational DB</p>
          <p>Internal
graph DB</p>
          <p>Gremlin
processor
plan has at most two successors. This enables a uniform visualization of the
tree-shaped query representation, which facilitates the evaluation of the query’s
data processing for the privacy officers.</p>
          <p>By applying the visitor pattern, the resulting plan is traversed to gather
helpful meta-information and to analyze its logical operator nodes. We support
most of Calcite’s provided node types and its variants. As a limitation, we do
not parse for example calculations in restrictions and projections, but store them
as simple text string. In our opinion, contents of these calculations are not
relevant for a classification regarding data-privacy compliance. The same applies
for CASE statements. Our approach also fails in uncovering any non-compliant
data processing inside of stored procedures or user-defined functions because
such calls are not resolved. This is out-of-scope of this paper.</p>
          <p>Finally, the enriched plan is persisted in the internal graph DB via the
internal DB interface. We are using Ferma5 for the mapping of Java objects to
elements in a graph. References between individual nodes of the plan are
represented by edges in the graph DB. Edge labels are not needed because all relevant
information is contained in the nodes. The edges are always directed from the
root to the leaf nodes, according to the order direction of the node processing
for the comparison of the query with the policy rules.
5 http://syncleus.com/Ferma/
Query Plan Operator Meta-Information: Some of the gathered
meta-information is persisted supplementary in the internal relational DB to speed up
access to it. This includes the depth of each plan, the used relations, and the
operator nodes. We use this information for pre-filtering of compatible query
subsets to avoid checking the whole set of queries against the defined rules.</p>
          <p>Additionally, for each operator node in the plan the following meta-information
is stored: A textual description of the node and an equivalent SQL query
retrieving the result data its respective subtree will produce. Privacy officers can display
this information for a better understanding of the operator’s functionality.
Furthermore, the relations used by a node and their schema attributes are stored
separately. Columns resulting from aggregation or renaming operations are
resolved during the analysis, and the original schema attributes are stored in the
node’s meta-information to detect provenance for such columns easily.
3.2</p>
          <p>Query Classification and Rule Definition
The enriched query plan is the basis for a query’s classification as legal or illegal
regarding data-privacy regulations. The Query classification module provides
this functionality.</p>
          <p>A query’s classification by itself is just another DB string holding the
respective information. In addition to this mere classification, privacy officers have to
define a related rule that generalizes the related query’s way of data
processing. For rule definition, we propose a domain-specific language (DSL) especially
designed for this purpose. We use the graph-computing framework Apache
TinkerPop and its included graph traversal language Gremlin6. TinkerPop supports
a variety of different graph DB implementations and Gremlin allows for
userdefined DSLs. Our DSL supports all operators that can occur in the enriched
logical query plan. Using a DSL increases flexibility and can easily be enhanced
with further functionality. The same applies for graph DBs as new relationships
can be added without having to restructure the DB schema. That is why we
decided to integrate a collateral graph DB to the existing relational DB.
Rule Definition via DSL: Each rule definition starts with the Query
statement, followed by one or several keywords separated by a dot. Each keyword
describes a logical operator and can be specified by several optional
parameters. The order of the keywords is relevant and will be considered for the graph
traversals. We will explain our DSL with the following two example rules based
on the rules #1 and #3 introduced in section 2:</p>
          <p>Rule #1: Query.hasJoin( "A.pseudonym", "B.pseudonym" )
Rule #3: Query.hasGroupBy( ).hasFilter( "B.*", "&lt;", "5", "COUNT" )
Rule #1 defines a graph traversal searching for queries joining tables A and
B via the pseudonym attribute. The following queries are examples for a match
to this rule:
6 http://tinkerpop.apache.org/gremlin.html
&gt; SELECT name, gender, race, professional_group</p>
          <p>FROM A JOIN B USING (pseudonym);
&gt; SELECT *</p>
          <p>FROM A, B</p>
          <p>WHERE A.pseudonym = B.pseudonym;
&gt; SELECT name</p>
          <p>FROM A JOIN B ON (A.pseudonym = B.pseudonym)</p>
          <p>WHERE gender = ’M’;</p>
          <p>Rule #3 defines a graph traversal searching for queries containing an
aggregation at any point, followed by a filter. The aggregation keyword represents all
GROUP BY clauses in SQL. Equivalents for the filter are all HAVING clauses. WHERE
clauses do not apply in this case because they filter rows before the aggregation.
The related filter condition says that the attribute B.* is used in the aggregation
function COUNT. The resulting value has to be smaller than five. The following
queries are examples for a match to this rule:
&gt; SELECT COUNT(pseudonym), gender</p>
          <p>FROM B GROUP BY gender</p>
          <p>HAVING COUNT(pseudonym) &lt; 5;
&gt; SELECT COUNT(*), race, professional_group</p>
          <p>FROM B GROUP BY race, professional_group</p>
          <p>HAVING COUNT(*) &lt; 5;
Rule Processing: The rules formulated in our DSL are stored as text in the
relational DB. Gremlin converts this text string into a domain-specific traversal
that can run on a graph DB. In this step, domain-specific keywords and their
parameters are compiled into their associated Java objects and method calls.
Finally, the enriched query-plan graph will be queried by the traversals defined
in the rules. Up to now, we do not have implemented any graph-similarity or
comparable algorithms. A query matches to a rule just if the domain-specific
rule traversal provides a result for this query.
3.3</p>
          <p>
            Real-time Policy Enforcement
To enable real-time policy enforcement, we integrate the JDBC proxy driver
presented in [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] in our framework. This requires least configuration effort and is
minimally intrusive as the proxy just wraps the native drivers of the target DB
systems and enables supplementary functionality at runtime by using reflection
techniques. The driver supports tools that use JDBC to connect to target DBs
and is inserted between the end-users’ SQL tooling and the target DBs. Fig. 4
illustrates its interaction with end-users, our framework, and the target DBs.
          </p>
          <p>When an end-user’s SQL tool sends a query to a target DB (step 1), our proxy
intercepts this query and routes it to our framework’s server application (step
2). We use Apache Kafka7 as message-oriented middleware for communication.
7 https://kafka.apache.org/
6 JDBC proxy driver
1
2</p>
          <p>5
SQL-based tooling
Prod. target DBs
3</p>
          <p>4</p>
          <p>Policy-enforcement server
The server parses the query and generates the enriched query plan. This plan is
checked against all rules available in the system. If no rule matches, the query
is assumed compliant to the data-privacy regulations and is forwarded to the
target DB (step 3), which sends an appropriate response back to the server (step
4). Before this response is forwarded to the proxy (step 5), some of its
metainformation (including runtime, number of tuples, and attributes in the result)
is extracted and stored in the server’s query repository. Finally, the proxy routes
the response back to the SQL tool that has sent the initial query (step 6).</p>
          <p>If one or several rules match the query, it is not sent to the designated
target DB, because it is considered non-compliant to data-privacy regulations.
Therefore, the steps 3 and 4 are skipped, and instead an appropriate message
of type java.sql.SQLException is routed via the proxy (step 5) to the end-user’s
SQL tool (step 6).</p>
          <p>Optimization of Query-Rule Comparison: In order to optimize the
effort of comparing an incoming query with every existing rule for a match, we
narrow down the number of qualifying rules by pre-filtering. We use the
metainformation defined in section 3.1 to exclude rules that cannot apply for a certain
query, e. g. rules that do not use the same relation and operator set like the query,
or rules using more operators than the query. However, neither the operator order
nor which operator uses which relation are considered in this step.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation Proposal</title>
      <p>As mentioned in the previous section, our implementation is based on a proxy
driver that intercepts the submitted queries and processes additional analysis
steps. Since we build an interactive system that resides between the end-user
and the target DB, it is inevitable that the execution of a query in general takes
longer than a direct execution on the DB. The increase in the runtime of a
query plays a decisive role for the acceptance of the whole system. In a
followup evaluation, one could also consider the memory consumption, the network
communication or other relevant quantities.</p>
      <p>To achieve expedient runtime measurements, we adopt a two-sided approach:
First, we analyze the logical processing steps that are executed for each query.
This includes parsing the query, analyzing its logical query plan, enriching this
plan and persisting it in the internal query repository. We isolate these steps in
a unit test. Each of these unit tests is executed repeatedly via the Java
Monitor Harness (JMH) 8. This tool has explicitly been built for nano/ micro/ milli
benchmarks in Java and takes into account that modern computer systems
behave indeterministically. The tool even tries to reduce these effects, e. g. via
cache-warm-up runs. This procedure allows obtaining a good estimate of the
cost ratios of individual logical processing steps. A comparison with the actual
runtime of the query also gives an initial insight into the runtime overhead of
our proposed system.</p>
      <p>
        Since we develop an interactive application for query-driven rule-based policy
enforcement, we do not rely on just measuring the runtime of idealized and
isolated unit tests. Instead, we also measure the actual runtime of code parts used
by an interactive call to our system under practical conditions. We plan to
evaluate our framework based on the TPC-DS benchmark9 queries. Hence, we utilize
techniques of aspect-oriented programming [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to insert small-footprint runtime
measurement aspects into our application, while keeping the code clean. In
contrast to the first evaluation part, this one explicitly ignores any indeterministic
effects and therefore may produce less accurate results. However, we assume
that this is a suitable method for demonstrating the eligibility of our interactive
approach even under typical and practical (not idealistic) circumstances.
      </p>
      <p>Because of our two-sided evaluation approach, we expect to be able to
demonstrate that our system’s runtime overhead is acceptable assuming ideal conditions
(which means the algorithmic complexity of our analysis process is feasible), but
it is usable without any restrictions even under practical circumstances.
5</p>
    </sec>
    <sec id="sec-4">
      <title>State of the Art</title>
      <p>In our literature search, we found several similar approaches. They can be
grouped into approaches that identify illegal queries after their execution, i. e.
auditing approaches, those that prevent the execution of illegal queries in advance
and general-purpose query-log analysis tools.</p>
      <p>
        Both Qapla [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and DataLawyer [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] are systems designed for ad-hoc policy
enforcement, allowing to specify policies in plain SQL. However, an important
feature of our approach is that privacy officers need not have SQL knowledge,
but can define policy rules for stored data in an intuitive and controllable way.
      </p>
      <p>
        Regarding the auditing of SQL queries, Raghav Kaushik [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and his colleagues
have postulated essential compromises that are required to audit arbitrary SQL
queries in respect to data-privacy compliance. In fact, this means to adopt a
8 https://openjdk.java.net/projects/code-tools/jmh/
9 http://www.tpc.org/tpcds/
weaker privacy model than the one we assume, and that does not comply with
today’s strict data-privacy regulations. To the contrary, we decided to not
implement every single SQL feature, but then being able to prevent the execution
of illegal queries at runtime. However, our subset of SQL instructions
nevertheless covers most of the usual and common SQL keywords that are relevant for
enforcing data-privacy compliance (cf. section 3.1).
      </p>
      <p>
        Agrawal et al. describe the advantages of a user-friendly and intuitive
specification of policies in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, their system is exclusively specialized in
auditing, whereas we aim to enforce data-protection guidelines at runtime as
well. The same holds for the approach presented in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>In addition to scientific approaches, we also found a commercial product,
which offers tooling for the classification of data-lake accesses with regard to
their data-protection compliance. BigDataRevealed 10 is a GDPR application
solution and allows finding GDPR-regulated data in data lakes by searching for
suspicious column names. A column “ Social Security Number ” may, for
example, be classified as data that is protected by GDPR. However, we are not aware
of any tool that is able to examine incoming requests and to deny a request if
classified as illegal according to a given policy.</p>
      <p>
        There are also several approaches for interactive SQL query log analysis (e.g.
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). While these projects provide broad general-purpose analysis
capabilities, we focus on enabling non-experts to analyze SQL workloads.
6
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>Our framework allows for automatic query classification concerning data-privacy
regulations in real-time. In an initial audit phase, privacy officers classify
appropriately selected queries from our internal query repository that intercepts
queries submitted to the target systems. During this audit phase, the privacy
officers also define policy rules for a data processing conforming to data privacy.
Unclassified queries in the query logs (the “test set”) are then classified
semiautomatically in bunches with minimal user interaction. After that, our system
is inserted minimally intrusively between the end-users’ SQL tooling and the
target DBs by a JDBC proxy DB driver. From now on, our system
automatically enforces the defined privacy-policy rules on the intercepted queries. This
guarantees a strict prohibition of data processing not conforming to data privacy.</p>
      <p>The implementation of the functionality for parsing the queries and the rules
engine for classification is practically completed. Our implementation handles
also sophisticated queries and can enforce the defined policy rules. We are
currently implementing the functionality to perform the proposed evaluation
measurements for our approach. We are also integrating the proxy driver for real-time
policy enforcement at this time.</p>
      <p>There is also some future work to accomplish. We need to enhance our
framework by the tree-shaped visualization mentioned above. It will support the
privacy officers in evaluating the queries’ data processing. Furthermore, we aim
10 https://gdprapplication.blog
to provide a user interface for rule definition that supports a graphical
building of rules by dragging and dropping desired logical operator nodes from the
tree-shaped visualization.</p>
      <p>Acknowledgement: The authors would like to thank the anonymous reviewers for their
valuable remarks.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al.:
          <article-title>Auditing compliance with a hippocratic database</article-title>
          .
          <source>In: (e)Proc. 30th Int. Conf. on VLDB</source>
          , Toronto, Canada,
          <source>Aug 31 - Sep 3</source>
          . pp.
          <fpage>516</fpage>
          -
          <lpage>527</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. den Bussche,
          <string-name>
            <surname>J.V.</surname>
          </string-name>
          , et al.:
          <article-title>Towards practical meta-querying</article-title>
          .
          <source>Inf. Syst</source>
          .
          <volume>30</volume>
          (
          <issue>4</issue>
          ),
          <fpage>317</fpage>
          -
          <lpage>332</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Council of EU:
          <article-title>Council regulation (EU) no 2016/679 of the European Parliament and of the Council of 27 apr 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data</article-title>
          ,
          <source>and repealing Directive</source>
          <volume>95</volume>
          /46/ec (
          <article-title>General Data Protection Regulation)</article-title>
          .
          <source>OJ L 119</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>88</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Goddard</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The EU General Data Protection Regulation (GDPR): European regulation that has a global impact</article-title>
          .
          <source>Int. Journal of Market Research</source>
          <volume>59</volume>
          (
          <issue>6</issue>
          ),
          <fpage>703</fpage>
          -
          <lpage>705</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , et al.:
          <article-title>A unified audit expression model for auditing SQL queries</article-title>
          .
          <source>In: Data and Applications Security XXII, 22nd Annual IFIP WG 11.3 Working Conf. on Data and Applications Security</source>
          , London, UK, Jul
          <volume>13</volume>
          -16. pp.
          <fpage>33</fpage>
          -
          <lpage>47</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kaushik</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al.:
          <article-title>Efficient auditing for complex SQL queries</article-title>
          .
          <source>In: Proc. ACM SIGMOD</source>
          , Athens, Greece,
          <source>Jun 12-16</source>
          . pp.
          <fpage>697</fpage>
          -
          <lpage>708</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kiczales</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.:
          <article-title>Aspect-oriented programming</article-title>
          .
          <source>In: ECOOP'97 - ObjectOriented Programming, 11th European Conf., Jyväskylä, Finland, Jun 9-13</source>
          . pp.
          <fpage>220</fpage>
          -
          <lpage>242</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , et al.:
          <article-title>Constructing an interactive natural language interface for relational databases</article-title>
          .
          <source>PVLDB</source>
          <volume>8</volume>
          (
          <issue>1</issue>
          ),
          <fpage>73</fpage>
          -
          <lpage>84</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mehta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Qapla: Policy compliance for database-backed systems</article-title>
          .
          <source>In: 26th USENIX Security Symposium</source>
          , Vancouver, BC, Canada,
          <source>Aug 16-18</source>
          . pp.
          <fpage>1463</fpage>
          -
          <lpage>1479</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>P.K.</given-names>
          </string-name>
          , et al.:
          <article-title>Towards query-driven data minimization</article-title>
          .
          <source>In: Proc. Conf</source>
          . LWDA, Mannheim, Germany,
          <source>Aug 22-24</source>
          . pp.
          <fpage>335</fpage>
          -
          <lpage>338</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>P.K.</given-names>
          </string-name>
          , et al.:
          <article-title>Query-driven data minimization with the dataeconomist</article-title>
          .
          <source>In: Advances in Database Technology - 22nd Int. Conf. on Extending Database Technology</source>
          ,
          <string-name>
            <surname>EDBT</surname>
          </string-name>
          , Lisbon, Portugal,
          <source>Mar 26-29</source>
          . pp.
          <fpage>614</fpage>
          -
          <lpage>617</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Tikkinen-Piri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , et al.:
          <article-title>EU general data protection regulation: Changes and implications for personal data collecting companies</article-title>
          .
          <source>Computer Law &amp; Security Review</source>
          <volume>34</volume>
          (
          <issue>1</issue>
          ),
          <fpage>134</fpage>
          -
          <lpage>153</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Upadhyaya</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Automatic enforcement of data use policies with datalawyer</article-title>
          .
          <source>In: Proc. ACM SIGMOD</source>
          , Melbourne, Victoria, Australia, May 31 - Jun 4. pp.
          <fpage>213</fpage>
          -
          <lpage>225</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Wahl</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          , et al.:
          <article-title>Crossing an OCEAN of queries: analyzing SQL query logs with oceanlog</article-title>
          .
          <source>In: Proc. of the 30th Int. Conf. on Scientific and Statistical Database Management</source>
          ,
          <string-name>
            <surname>SSDBM</surname>
          </string-name>
          , Bozen-Bolzano, Italy,
          <source>Jul 09-11</source>
          . pp.
          <volume>30</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          :
          <issue>4</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Wahl</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          , et al.:
          <article-title>Minimally-intrusive augmentation of data science workflows</article-title>
          .
          <source>In: Proc. Conf</source>
          . LWDA, Mannheim, Germany,
          <source>Aug 22-24</source>
          . pp.
          <fpage>339</fpage>
          -
          <lpage>342</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>