Towards Query-Driven Data Minimization

     Peter K. Schwab, Julian O. Matschinske, Andreas M. Wahl, and Klaus
                                Meyer-Wegener

             FAU Erlangen-Nürnberg, Informatik 6 (Datenmanagement)

       Abstract. Due to the data privacy laws inured recently, data minimiza-
       tion is a highly topical issue. So far, many companies have collected and
       processed a vast amount of personal data while treating data privacy
       negligently. They now need to narrow down collection and processing,
       but do not know, which data is essential for transacting their businesses.
       To overcome these problems, we propose query-driven data minimization.
       Our approach grants user-specific access to database schemas based on
       SQL reports and supports deletion of unneeded schema elements.
       In this paper, we outline the schema of a query repository for managing
       query meta-information. We introduce a user story emphasizing the ben-
       efits of our approach for query-driven data minimization and we sketch
       the basic architecture of a framework implementing our approach.

1     Introduction
The EU General Data Protection Regulation1 (GDPR) recently took effect and
requires substantial changes for almost every institution storing personal data.
For example, Art 5 GDPR postulates the principle of data minimization, i. e.
to collect, process, and use as little personal data as possible. However, it is a
tremendous effort to evaluate which data must really be collected and which
users are allowed to use them.
    On one side, there are data scientists with access to large parts of the com-
panies’ datasets. They are often unaware of data privacy and will not restrict
data access by themselves. On the other side, there are privacy officers who are
well aware of data privacy. They usually cannot prohibit fine-grained, technical
data access, e. g. based on SQL queries, but only step in when data collection or
processing has already taken place.
    The data-privacy community advocates privacy by design [7], which incor-
porates privacy protection into the overall design of technical systems. In this
paper, we take up privacy by design and propose a minimally-intrusive approach
regarding query-driven data minimization in existing relational IT landscapes.
We enable privacy officers to grant customized access to relational databases
based on the queries that users run on these systems. Furthermore, our ap-
proach shows which data are actually not used in any data processing and lists
the queries that have inserted the data into the database. Privacy officers can
trace and delete unneeded data as well as prevent their future collection. To
reach this, we have extended an initial design for a query repository proposed in
[8] in order to log queries and enrich them with meta-information.
1
    http://eur-lex.europa.eu/eli/reg/2016/679/2016-05-04
2                 Peter K. Schwab et al.

2        Query Repository
The query repository records SQL-based queries and their characteristics. Fig. 1
shows an Entity-Relationship diagram with its conceptual schema.

                                           Query Context                      1                happens in     Query
                                                                                                            Execution
   Query
Environment
                     Query                                                                                   Timestamp

     DBName          Context
                             ContextName      Description          Priority       Legitimacy      N
                                                                                                              RunTime

       Host
                        Database                                                             Query
                                                                                                            ErrorMessage
                                              1                                            Execution
       Port
                                                            runs
                                                             at                                                Text
                                                                                                                              RelName
    Username

                           User              1                                1                Query         Relations
                                                                                                                              Attributes
    Credentials
                                                                                                            Query Structure

                         Fig. 1. The conceptual schema of the query repository

    While most entities, relationships, and attributes are self-explanatory, Query
Context needs some remarks. It has an identifying ContextName and an optional
Description. These attributes point out a user’s intention or the situation in which
she runs a query via the relationship happens in. If a Query Execution happens
in a certain Query Context, the user can additionally specify the query’s priority
and its legitimacy in this context. A query execution can only be done in one
context, but a context may contain many query executions.
    We organize query characteristics in four groups highlighted in color in Fig. 1.
The first one is the Query Structure that we derive automatically from the query
itself. The other groups hold meta-information on the queries. The Query Envi-
ronment is derived from the user and her initial connection to the target database.
The Query Execution stems from the query, the executing user, and the database.
Finally, the Query Context cannot be derived automatically; it must be entered
manually. It typically contains tacit user knowledge, which we externalize here.

3        Benefits of a Query Repository
To explain the benefits of a query repository, we describe a user story from a
clinical research scenario: The databases of a clinical information system (CIS)
store data required for the clinical operation. The hospital’s employees access
these databases directly in sessions using SQL-based tooling and generate re-
ports for their purposes. For example, Bob is responsible for billing of services
provided by the hospital. He reads the patients’ address data as well as ICD
(International Classification of Diseases) codes stored in case files. Doctor Carol
generates reports for a clinical study based on case files (“secondary use” of
data). She reads medication data as well as data regarding the patients’ health
status. Our query repository (QRep) logs all queries submitted to the CIS and
automatically derives related query characteristics.
                                   Towards Query-Driven Data Minimization         3

    In the past, Bob’s and Carol’s access to the databases was unrestricted, i. e.
Bob had access to medication data, although not relevant for billing purposes,
and Carol was able to see address data not needed for her clinical study. We are in
June 2018 and Alice, the hospital’s privacy officer, wants to apply Art 5 GDPR in
terms of data minimization. Therefore, she displays Bob’s report results together
with its related queries in the QRep WebApp. She marks all queries as ‘legal’ if the
reports are compliant with Art 5 GDPR. Otherwise, she identifies unauthorized
queries and marks them as ‘illegal’. The QRep server stores her markers as Query
Context information. Finally, Alice repeats the procedure for Carol’s reports.
    Next, Alice clicks a button and the QRep server determines all schema at-
tributes and relations used in the queries marked as ‘legal’, grouped by database
user. Our tool then retrieves the entire schema information from the databases,
requests Alice to decide each user’s access rights to unused schema elements and
finally enforces these rights to the databases. In the example, Carol must no
longer access the entire relation with the patient’s address data, and Bob has
access only to the ICD attribute in relation to the case-related diagnostic data
of the patients. This ensures data minimization regarding data processing.
    Alice clicks another button and our tool determines the schema elements to
which no user is granted read access anymore. It also lists the queries that have
inserted the data. Alice can see now that there is, e. g. an attribute holding a
patient’s nationality that is not needed for any data processing and can delete
this attribute by a single click. She also sees that employee Dave ran the related
queries, contacts him and ensures data minimization regarding data collection.

4    Reference Architecture
To maintain compliance with data-protection regulation, responsible officers can
use our WebApp. It provides an easy-to-use interface and does not require any
knowledge of SQL. The WebApp interacts with the QRep server via a JSON-based
REST API. The overall system architecture is shown in Fig. 2. The QueryMan-
ager passes incoming SQL queries to the Analyzer module, which enriches them
with Query Structure characteristics by using Apache Calcite as parser. The DB
Analyzer connects with target databases to determine Query Execution charac-
teristics. Users can add Query Context characteristics via WebApp. The QRep
DB persists this information. The SearchModule allows for browsing queries and
their characteristics and displaying report results. The DB Analyzer also provides
full target schema information for detecting unneeded relations and attributes.

    Data Sources                 Log Reader            QRep WebApp
                                 DB Analyzer
      Apache Calcite           Query Analyzer         Search Module

       QRep DB                     Query Manager                  QRep Server
                   Fig. 2. The overall architecture of our framework
4       Peter K. Schwab et al.

5     Related Work
Khoussainova et al. propose dedicated systems for query management [3]. How-
ever, they focus on query completion [4]. Srivastava et al. provide a relational
framework for managing queries [9]. We follow their approach and manage
queries and their meta-information together in a relational model. A tool for
session-based query-log browsing is suggested in [2]. However, queries are not en-
riched with meta-information and browsing is limited to keyword-based search.
Van den Bussche et al. provide user-defined functions to query ASTs stored
as XML files using a relational database [1]. Query context and other meta-
information are not considered. This is also the case for several graph-based
approaches for representing SQL queries, e. g. [5, 6]. In [10], we already analyzed
query logs to facilitate data integration. However, we did not yet provide a way
to report user-specific meta-information about queries. Manta and SQLDep2
visualize schema lineage of SQL queries without considering query context.

6     Summary and Future Work
Our framework supports data minimization regarding the collection and pro-
cessing of data. The query-driven approach allows to customize user access to
the data and to identify unneeded schema elements. We plan to evaluate our
approach based on publicly accessible query logs. Regarding data minimization,
we also aim to consider unneeded tuples.

Acknowledgement: The authors would like to thank the anonymous reviewers for their
valuable remarks.

References
 1. Van den Bussche, J., Vansummeren, S., Vossen, G.: Towards practical meta-
    querying. Information Systems 30(4), 317–332 (2005)
 2. Khoussainova, N., et al.: Session-based browsing for more effective query reuse. In:
    Proc. SSDBM. pp. 583–585. Springer (2011)
 3. Khoussainova, N., Balazinska, M., Gatterbauer, W., Kwon, Y., Suciu, D.: A case
    for a collaborative query management system. In: Proc. CIDR (2009)
 4. Khoussainova, N., Kwon, Y., Balazinska, M., Suciu, D.: Snipsuggest: Context-
    aware autocompletion for SQL. Proc. VLDB 4(1), 22–33 (2010)
 5. Koutrika, G., Simitsis, A., Ioannidis, Y.E.: Explaining structured queries in natural
    language. In: Proc. ICDE. pp. 333–344. IEEE (2010)
 6. Papastefanatos, G., et al.: Hecataeus: A framework for representing SQL constructs
    as graphs. In: Proc. EMMSAD. vol. 5 (2005)
 7. Schaar, P.: Privacy by design. Identity in the Information Society 3(2), 267–274
    (Aug 2010). https://doi.org/10.1007/s12394-010-0055-x
 8. Schwab, P., et al.: Query-driven data integration (short paper). In: Proc. LWDA.
    CEUR Workshop Proc., vol. 1670, pp. 206–211. CEUR-WS (2016)
 9. Srivastava, D., Velegrakis, Y.: Intensional associations between data and metadata.
    In: Proc. SIGMOD. pp. 401–412. ACM (2007)
10. Wahl, A.M., et al.: Query-driven knowledge-sharing for data integration and col-
    laborative data science. In: Proc. ADBIS. pp. 63–72 (2017)

2
    https://getmanta.com/ and https://sqldep.com/