<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Predicting the Amount of GDPR Fines</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Future Technologies, University of Turku</institution>
          ,
          <addr-line>Turku</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The General Data Protection Regulation (GDPR) was enforced in 2018. After this enforcement, many nes have already been imposed by national data protection authorities in the European Union (EU). This paper examines the individual GDPR articles referenced in the enforcement decisions, as well as predicts the amount of enforcement nes with available meta-data and text mining features extracted from the enforcement decision documents. According to the results, articles related to the general principles, lawfulness, and information security have been the most frequently referenced ones. Although the amount of nes imposed vary across the articles referenced, these three particular articles do not stand out. Furthermore, good predictions are attainable even with simple machine learning techniques for regression analysis. Basic meta-data (such as the articles referenced and the country of origin) yields slightly better performance compared to the text mining features.</p>
      </abstract>
      <kwd-group>
        <kwd>Text mining</kwd>
        <kwd>Legal mining</kwd>
        <kwd>Data protection</kwd>
        <kwd>Law enforce- ment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Data protection has a long history in the EU. In particular, the GDPR
repealed the earlier Directive 95/46/EC. Although this directive laid down much
of the legal groundwork for EU-wide data protection and privacy, its national
adaptations, legal interpretations, and enforcement varied both across the
member states and di erent EU institutions [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In short: it was a paper tiger. In
contrast, Regulation (EU) 2016/679, the GDPR, is a regulation; it is binding
throughout the EU with only a minimal space for national adaptations. In
practice, only a few Articles (A) in the GDPR provide some but limited room for
national maneuvering; these include A6 with respect to relaxation in terms of
other legal obligations or public interests, A9 in terms of sensitive data, and A10
regarding criminal matters. Thus, in general, this particular legislation should
be interpreted and enforced uniformly through the European Union by national
data protection authorities whose formal powers are de ned in A58. In
practice, however, already the resources and thus the actual power for enforcement
vary across the member states [
        <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
        ]. Coupled with a lack of previous research
on the enforcement of the GDPR, this variance provides a motivation for the
present work to examine the recent enforcement nes imposed according to the
conditions speci ed in A83. In addition, the work is motivated by a tangential
question; is it also possible to predict these nes by machine learning methods?
      </p>
      <p>To answer to the question, the paper uses meta-data and text miming features
extracted from the decision documents released by the national authorities. As
such, only black-box predictions are sought; the goal is not to make any legal
interpretations whatsoever. Nevertheless, the answer provided still establishes a
solid contribution|especially when considering that the paper is presumably the
very rst to even examine the GDPR nes. As is discussed in Section 2, the
blackbox approach also places the paper into a speci c branch of existing research
dealing with legal documents. This section also re nes the question into two
more speci c research questions. Afterwards, the structure is straightforward:
the dataset and methods are elaborated in Sections 3 and 4, results are presented
in Section 5, and conclusions follow in Section 6. As will be noted in the nal
section, there are also some lessons that should not be learned from this work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Legal mining|in lack of a better term|has emerged in recent years as a
promising but at times highly contested interdisciplinary eld that uses machine
learning techniques to analyze various aspects related to law [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Although the
concrete application domains vary, case law and court cases are the prime
examples already because these constitute the traditional kernel of legal scholarship.
Within this kernel, existing machine learning applications range from the
classi cation of judges' ideological positions [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which may be illegal in some
European countries [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], to the prediction of decisions of the European Court of
Human Rights [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. These examples convey the traditional functions of
applied machine learning; exploratory data mining and the prediction of the future.
      </p>
      <p>
        There is also another closely related application domain. Again in lack of a
better term, data extraction could be a label for this domain: by exploiting the
nature of law as an art of persuasion [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the domain uses distinct information
retrieval techniques to extract and quantify textual data from legal documents
into structured collections with a prede ned logic and semantics [
        <xref ref-type="bibr" rid="ref2 ref24 ref28">2, 24, 28</xref>
        ]. To
gain a hint about the extraction, one might consider a legal document to contain
some facts, rights, obligations, and prohibitions, statements and modalities about
these, and so forth. Although the two application domains are complementary
in many respects, the underlying rationales exhibit some notable di erences.
      </p>
      <p>
        Oftentimes, the legal mining domain is motivated by a traditional rationale
for empirical social science research: to better understand trends and patterns
in lawmaking and law enforcement; to contrast these with legal philosophies and
theories; and so forth. This rationale extends to public administration: machine
learning may ease the systematic archiving of legal documents and the nding of
relevant documents, and, therefore, it may also reduce administrative costs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
These administrative aspects re ect the goal of building \systems that assist in
decision-making", whereas the predictive legal mining applications seek to build
\systems that make decision" [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Although the data extraction domain can
be motivated by the same administrative rationale, providing data to predictive
systems is seldom the intention behind the extraction. Instead, there is a further
rationale in this domain: to extract requirements for software and systems in
order to comply with the laws from which a given extraction is done [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Driven
by the genuine interest to facilitate collaboration between lawyers and engineers
in order to build law-compliant software and systems [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], this rationale has
been particularly prevalent in the contexts of data protection and privacy. For
instance, previous work has been done to extract requirements from the Health
Insurance Portability and Accountability Act in the United States [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Against
this backdrop, it is no real surprise that data extraction has been applied also
for laws enacted in the EU. While there is previous work for identifying
requirements from the GDPR manually [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], there indeed exists also more systematic
data extraction approaches [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. However, neither domain has addressed the
enforcement of this EU-wide regulation. In fact, a reasonably comprehensive
literature search indicates no previous empirical research on the GDPR's
enforcement. Given this pronounced gap in the existing literature, this paper sets
to examine the following two Questions (Q) regarding the enforcement nes:
Q1: (i) Which GDPR articles have been most often referenced in the recent
enforcement cases, (ii) and do the enforcement nes vary across these articles?
Q2: How well the recent GDPR nes can be predicted in terms of basic available
(i) meta-data and (ii) textual traits derived from the enforcement decisions?
These two questions place the present work into the legal mining domain. Also
the underlying rationales are transferable. For instance, an answer to Q1 helps
to understand which aspects of the GDPR have been actively enforced during
the early roll out of the regulation. Also Q2 carries a practical motivation: by
knowing whether the penalties are predictable by machine learning techniques,
a starting point is available for providing further insights in di erent practical
scenarios. These scenarios range from the automated archival of enforcement
decisions and the designation of preventive measures to litigation preparations.
However, it is important to remark that the GDPR's enforcement is done by
national data protection authorities. Although the focus on public administration
is maintained nevertheless, documents about the enforcement decisions reached
by these authorities should not be strictly equated to law-like legal documents.
This point provides an impetus to move forward by elaborating the dataset used.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data</title>
      <p>
        The dataset is based on a GDPR enforcement tracker that archives the nes
and penalties imposed by the European data protection authorities [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This
tracker is maintained by an international law rm for archiving many of the
known enforcement cases. Each case is accompanied by meta-data supplied by
the rm as well as a link to the corresponding decision from a national authority.
In addition to potentially missing cases due to the lack of publicly available
information, the archival material is unfortunately incomplete in many respects.
The reason originates from the incoherent reporting practices of the European
data protection authorities. Therefore, all cases were obtained from the tracker,
but the following four steps were followed to construct a sample for the analysis:
1. To maintain coherence between Q1 and Q2, only those cases were included
that had both meta-data and links to the decisions available. In terms of the
former, some cases lacked meta-data about the nes imposed, the particular
GDPR articles referenced in the decisions, and even links to the decisions.
2. To increase the quality of the sample, only those cases were included that
were accompanied with more or less formal documents supplied on the o cial
websites of the data protection authorities. Thus, those cases are excluded
whose archival material is based online media articles, excerpts collected from
annual reports released by the authorities, and related informal sources.
3. If two or more cases were referenced with the same decision, only one decision
document was included but the associated meta-data was uni ed into a single
case by merging the articles references and totaling the nes imposed.
4. All national decisions written in languages other than English were
translated to English with Google Translate. In general, such machine translation
is necessary due to the EU-wide focus of the forthcoming empirical analysis.
      </p>
      <p>Given these restrictions, the sample amounts to about 72% of all cases
archived to the tracker at the time of data collection. Even with these
precautions, it should be stressed that the quality of the sample is hardly optimal.
While the accuracy of the meta-data supplied by the rm is taken for granted,
there are also some issues with the quality of the publicly available decisions.
The authorities in some countries (e.g., Hungary and Spain) have released highly
detailed and rigorous documents about their decisions, while some other
authorities (e.g., in Germany) have opted for short press releases. Although most of
the documents were supplied in the portable document format (PDF) and
informally signed by the authorities, it should be thus stressed that the data quality
is not consistent across the European countries observed. In addition, it is worth
remarking the detail that scanned PDF documents (as used, e.g., in Portugal)
had to be excluded due to the automatic data processing. While these data
quality issues underline the paper's exploratory approach, these carry also political
and administrative rami cations that are brie y discussed later on in Section 6.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Methods</title>
      <p>Descriptive statistics and regression analysis are used for answering to the two
questions asked. In terms of Question Q1, dummy variables for the GDPR
articles referenced are simply regressed against the logarithm of the nes imposed
by using the conventional analysis-of-variance (ANOVA). As many of the cases
reference multiple articles, it should be remarked that these dummy variables
are not so-called xed e ects. The methods for answering to the second
Question Q2 require a more thorough elaboration. In addition to (i) the GDPR
articles, the meta-data aspects include dummy variables for the following features:
(ii) the year of a given enforcement case; (iii) the country in which the given
ne was imposed; and (iv) the sector of the violating organization. The last
feature was constructed manually by using ve categories: individuals, public sector
(including associations), telecommunications, private sector (excluding
telecommunications), and unknown sector due to the lack of meta-data supplied in the
enforcement tracker. In total, these features amount to 49 dummy variables.</p>
      <p>
        The textual aspects for Q2 are derived from the translated decisions. Seven
steps were used for pre-processing: (a) all translated decision documents were
lower-cased and (b) tokenized according to white space and punctuation
characters; (c) only alphabetical tokens recognized as English words were included;
(d) common and custom stopwords were excluded; (e) tokens with lengths less
than three characters or more than twenty characters were excluded; (f) all
tokens were lemmatized into their common English dictionary forms; and, nally,
(g) those lemmatized tokens were excluded that occurred in the whole
decision corpus in less than three times. A common natural language processing
library [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] was used for this processing together with a common English
dictionary [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In addition to the stopwords supplied in the library, the twelve most
frequent tokens were used as custom excluded stopwords: data, article, personal,
protection, processing, company, authority, regulation, information, case, art, and
page. After this pre-processing, the token-based term frequency (TF) and term
frequency inverse document frequency (TF-IDF) were calculated from the whole
corpus constructed (for the exact formulas used see, e.g., [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]). These common
information retrieval statistics are used for evaluating the other part in Q2. In
general, TF-IDF is often preferred as it penalizes frequently occurring terms.
      </p>
      <p>
        Sparsity is the biggest issue for prediction. There are only 154 observations
but already the meta-data amounts to 49 independent variables|and the TF
and TF-IDF each to 4189 independent variables. Fortunately, the problem is not
uncommon, and well-known solutions exist for addressing it. Genomics is a good
example about the application domains riddled with the problem; within this
domain, it is not uncommon to operate with datasets containing a few thousand
observations and tens of thousands of predictors [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Dimension reduction is the
generic solution in this domain and other domains with similar problems. Thus,
three common dimension reduction methods for regression analysis are used:
principal component regression (PCR), partial least squares (PLS), and ridge
regression (for a concise overview of these methods see, e.g., [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]). In essence,
PCR uses uncorrelated linear combinations as the independent variables; PLS
is otherwise similar but also the dependent variable is used for constructing the
combinations. Ridge regression is based on a di erent principle: the
dimensionality is reduced by shrinking some of the regression coe cients to zero. In general,
all three methods are known to yield relatively similar results in applied work.
      </p>
      <p>
        In terms of practical computation, the number of components for the PCR
and PLS models, and the shrinkage parameter for the ridge regression, is
optimized during the training while the results are reported with respect to a test set
containing 20% of the enforcement cases. Centering (but not scaling) is used prior
to the training with a 5-fold cross-validation. Computation is carried out with
the caret package [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] in conjunction with the pls [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and foba [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] packages.
Although root-mean-square errors (RMSEs) are used for optimizing the
training, the results are summarized with mean absolute errors (MAEs) due to their
straightforward interpretability. These are de ned as the arithmetic means of
the absolute di erences between the observed and predicted nes in the test set.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>The GDPR nes imposed vary greatly. As can be seen from Fig. 1, a range
from about e6 euros to e12 euros capture the majority of the enforcement nes
observed. This range amounts roughly from about four hundred to 163
thousand euros. That said, the distribution has a fairly long tail; also a few large,
multi-million euro nes are present in the sample. Therefore, the sample cannot
be considered biased even though the restrictions discussed in Section 3 exclude
some of the largest enforcement cases, including the announcements about the
intention to ne the British Airways and Marriott International by the
Information Commissioner's O ce in the United Kingdom. Although these two excluded
cases are|at least at the time of writing|preliminary announcements, they are
still illuminating in the sense that both were about large-scale data breaches.
4
6
8
10
12
14
16</p>
      <p>18</p>
      <p>Fines (euros, logarithm)</p>
      <p>However, the GDPR's corresponding A32 for information security has not
been the most frequently referenced article in the recent enforcement cases.
Instead, A5 and A6, which address the general principles and lawfulness of personal
data processing, have clearly been the most referenced individual articles, as can
be seen from Fig. 2. These two articles account for as much as 87% of all 252
references made in the 154 enforcement cases. More than six references have
been made to A13 (informing obligations to data subjects), A15 (right to
access), A21 (right to object), and A17 (right to erasure). These references indicate
that enforcement has been active also with respect to the rights granted by the
GDPR for individual data subjects. Furthermore, less frequent references have
been made in the decisions to numerous other articles. These include the
obligations to designate data protection o cers (A37), conduct impact assessments
(A35), and consult supervisory authorities (A36), to name three examples. While
the principles, lawfulness, and information security account for the majority, the
less frequent but still visible references to more speci c articles hint that the
regulation's whole scope is slowly being enforced by the European authorities.</p>
      <p>A5</p>
      <p>A6 A32 A13 A15 A21 A17 A12 A33 A58 A9 A14 A25 A7</p>
      <p>Turning to the second part of Q1, the regression coe cients from the
loglinear ANOVA model are visualized in Fig. 3 (the intercept is present in the
model but not shown in the gure, and A36 is omitted as the single reference
made to the article corresponds with the single reference made to A35 in the same
decision; the dummy variable for A35 thus captures the e ect of both articles).
As can be seen, the con dence intervals (CIs) are quite wide for the articles
referenced only infrequently, and only six coe cients are statistically signi cant
at the conventional threshold. Thus, some care is required for interpretation.</p>
      <p>When looking at the coe cients with relatively tight CIs, it is evident that
variation is present but the magnitude of this variation is not substantial. Most of
the coe cients remain in the range [ 5; 5]. However, together all the references
do yield a decent model; an F -test is statistically signi cant and the coe cient
of determination is large (R2 ' 0:44). To put aside the statistical insigni cance,
it is also interesting to observe that some of the coe cients have negative signs,
meaning that some references indicate smaller nes compared to the average.
Among these are the conditions for consent (A7), sensitive data (A9),
transparency (A12), and informing (A13), as well as the already noted right to access
(A15), proper noti cations about data breaches (A33), and the powers granted
for the supervisory authorities (A58). Finally, the magnitude of the coe cient
(1:52) for the information security article (A32) is signi cant but does not stand
out in terms of magnitude. When compared to cases without a reference to this
article, only about 1:5% higher nes have been imposed in cases referencing A32.</p>
      <p>The results regarding Q2 are summarized in Fig. 4 (the MAEs for the training
refer to the best cross-validated models). Three noteworthy observations can
p ≥ 0.05
p &lt; 0.05</p>
      <p>A36 omitted
10
tn 5
e
i
c
ffi
oeC 0
-5
E
AM 1.0
2.0
1.5
0.5
0.0
2.0
1.5
0.5
0.0
E
AM 1.0</p>
      <p>A5 A6 A7 A9 A12 A13 A14 A15 A17 A18 A21 A25 A28 A31 A32 A33 A35 A37 A58 A83
be drawn from this summary. First and foremost, the prediction performance
is generally decent: the best-performing cases all yield MAEs roughly between
1:3 and 1:5 for the log-transformed nes. These average prediction errors seem
also reasonable when taking a closer look at the actual predictions|except for
the outlying large nes. Take Fig. 5 as a brief example; the gure displays the
observed nes and the predicted nes based on the PLS and ridge regression
estimators for the rst meta-data model. Even though most of the predicted
observations are fairly close to the observed nes, the test set also contains one</p>
      <sec id="sec-5-1">
        <title>PLS, Model 1.</title>
      </sec>
      <sec id="sec-5-2">
        <title>Ridge, Model 1.</title>
        <p>ve million euro ne that is quite severely underestimated by both regression
estimators. The underestimations amount to over 246 thousand euros. Though,
when a magnitude is measured in millions, it is a matter of interpretation whether
an error measured in hundreds of thousands is large, small, or something else.</p>
        <p>Second, there are some interesting di erences between the regression
estimators. In particular, PLS and ridge regression exhibit relatively large di erences
between training and testing. The explanation relates to the RMSE-based
optimization during training. For instance, PCR was estimated with only one
component for the rst meta-data model and three components for the remaining
three models, whereas two components were picked for all four PLS models.</p>
        <p>Last but not least, the smallest MAE for the test set is outputted by ridge
regression using only the 49 meta-data variables. The second and third models
containing the TF and TF-IDF variables both perform worse. Furthermore, the
fourth model, which contains the meta-data and TF-IDF variables, indicates that
the text mining features tend to slightly weaken the predictions. It is also worth
remarking that some redundancy is present among the meta-data variables;
comparable performance is obtained with only 17 meta-data variables that are left
after prior pre-processing with the caret 's nearZeroVar function. All this said,
the overall interpretation should be less explicit when considering the practical
motivation for Q2 noted in Section 2. If only the decision documents are
available without any prior work to manually construct the meta-data from these,
even the simple text mining features could be used for black-box predictions.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>This paper explored two questions. The answers to these can be summarized as
follows. First: regarding Q1, the articles related to the general principles (A5),
lawfulness (A6), and information security (A32) have been most frequently
referenced by the national data protection authorities during the early enforcement
period observed in this paper. Although also the enforcement nes vary across
the various GDPR articles referenced in the authorities' decisions, the e ects of
these three articles do not stand out in particular. A good corollary question
for further work would be to examine the future evolution of these references; a
hypothesis is that the regulation's enforcement is slowly moving from the
principles and lawfulness conditions to more speci c elements. Then: regarding Q2,
it is possible to obtain decent predictions even with standard machine
learning techniques for regression analysis. Basic meta-data (i.e., articles referenced,
year of enforcement, country or origin, and industry sector) seems to provide
slightly better predictive performance compared to basic text mining features
(i.e., TF and TF-IDF) extracted from the decision documents. Yet, even the
text mining features seem su cient for blind black-box predictions. There are
also many potential ways to improve the predictions reported, including those
related regression analysis (such as using speci c sparse-PLS estimators) and
text mining (such as using word embeddings). Data mining techniques (such as
topic modeling) could be used also for better understanding the nuances behind
the decisions. An alternative path forward would be to extend the speci c data
extraction approaches discussed in Section 2 to the enforcement decisions.
However, the motivation to move forward is undermined by practical problems. As
was remarked in Section 3, already the quality of data is a problem of its own.</p>
      <p>
        Recently, the enforcement of the GDPR has been ercely criticized by some
public authorities and pundits alike. The reasons are many: a lack of
transparency and cooperation between national data protection authorities,
diverging legal interpretations, cultural con icts, the so-called \one-stop-shop" system,
old-fashioned information systems and poor data exchange practices, and so on
and so forth [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. The data collection used for the present work testi es on behalf
of the criticism: the decision documents released by the national authorities have
varied wildly in terms of quality and rigor. Some national authorities have even
hidden their decisions from public scrutiny. A paradox is present: although A15
grants a right for data subjects to access their personal data, the same subjects
may need to exercise their separate freedom of information rights to obtain cues
about decisions reached by national authorities. Four legs good, two legs bad.
      </p>
      <p>
        Finally, it is necessary to brie y point out the bigger issues a ecting the legal
mining and data extraction domains|and, therefore, also the present work. For
one thing, the practical usefulness of legal expert systems has been questioned
for a long time. The arti cial intelligence hype has not silenced the criticism [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
Like with the \code is law" notion, which has never existed in reality [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], there
are also many philosophical counterarguments against the legal mining and data
extraction domains [
        <xref ref-type="bibr" rid="ref21 ref8 ref9">8, 9, 21</xref>
        ]. It is problematic at best to codify the methodology
of a scholarly discipline into rigid schemas in order to nurse the methodological
requirements of another discipline; legal reasoning is distinct from other types
of reasoning exercised in empirical sciences; and so forth. Law is not code. But
code is increasingly used to predict law enforcement decisions. The legal mining
domain, in particular, is frequently involved with a motivation to build \a
system that could predict judicial decisions automatically" but with a provision
that there is \no intention of creating a system that could replace judges" [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
Such system-building leads to another delicate paradox. Namely, the GDPR and
related laws (such as Directive 2016/680 for data protection in criminal matters)
were also designed to provide certain guards against legal mining and the
resulting automated decision-making involving human beings [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. This paper is not
immune to criticism originating from this fundamental paradox. If it is seen as
undesirable to build systems for making law enforcement decisions, it should be
also seen as undesirable to build systems for automatically ning companies.
Acknowledgements
This research was supported by the Academy of Finland (grant number 327391).
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bennett</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raab</surname>
          </string-name>
          , C.D.:
          <article-title>Revisiting the Governance of Privacy: Contemporary Policy Instruments in Global Perspective</article-title>
          . Regulation &amp;
          <article-title>Governance (Published online in September) (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Breaux</surname>
            ,
            <given-names>T.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vail</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anton</surname>
            ,
            <given-names>A.I.</given-names>
          </string-name>
          :
          <article-title>Towards Regulatory Compliance: Extracting Rights and Obligations to Align Requirements with Regulations</article-title>
          .
          <source>In: Proceedings of the 14th IEEE International Requirements Engineering Conference (RE</source>
          <year>2006</year>
          ). pp.
          <volume>49</volume>
          {
          <fpage>58</fpage>
          . IEEE,
          <string-name>
            <surname>Minneapolis</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Calomme</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Why Open Legal Data and Analytics Are Not Without Risks (</article-title>
          <year>2020</year>
          ),
          <article-title>Centre for IT &amp; IP Law (CiTiP) Blog</article-title>
          , KU Leuven, available online in April: https://www.law.kuleuven.be/citip/blog/ why-open
          <article-title>-legal-data-and-analytics-are-not-without-risks/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Chhatwal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huber-Fli et</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Keeling</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          :
          <article-title>Empirical Evaluations of Active Learning Strategies in Legal Document Review</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Big Data (Big Data</source>
          <year>2017</year>
          ). pp.
          <volume>1428</volume>
          {
          <fpage>1437</fpage>
          . IEEE, Boston (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>CMS</given-names>
            <surname>Law.Tax: GDPR Enforcement Tracker</surname>
          </string-name>
          (
          <year>2020</year>
          ),
          <article-title>Data obtained in 24 February from</article-title>
          : https://enforcementtracker.com/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Colombani</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croiseau</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fritz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guillaume</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Legarra</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ducrocq</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robert-Granie</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A Comparison of Partial Least Squares (PLS) and Sparse PLS Regressions in Genomic Selection in French Dairy Cattle</article-title>
          .
          <source>Journal of Dairy Science</source>
          <volume>95</volume>
          (
          <issue>4</issue>
          ),
          <volume>2120</volume>
          {
          <fpage>2131</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Custers</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dechesne</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sears</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tani</surname>
          </string-name>
          , T., van der Hof, S.:
          <article-title>A Comparison of Data Protection Legislation and Policies Across the EU</article-title>
          .
          <source>Computer Law &amp; Security Review</source>
          <volume>34</volume>
          (
          <issue>2</issue>
          ),
          <volume>234</volume>
          {
          <fpage>243</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Dyevre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wijtvliet</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lampach</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>The Future of European Legal Scholarship: Empirical Jurisprudence</article-title>
          .
          <source>Maastricht Journal of European and Comparative Law</source>
          <volume>26</volume>
          (
          <issue>3</issue>
          ),
          <volume>348</volume>
          {
          <fpage>371</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          : Discussion Paper:
          <article-title>How Much of Commonsense and Legal Reasoning is Formalizable? A Review of Conceptual Obstacles</article-title>
          .
          <source>Law, Probability and Risk</source>
          <volume>11</volume>
          (
          <issue>2</issue>
          {3),
          <volume>225</volume>
          {
          <fpage>245</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Fuster</surname>
            ,
            <given-names>G.G.</given-names>
          </string-name>
          :
          <article-title>The Emergence of Personal Data Protection as a Fundamental Right of the EU</article-title>
          . Springer, Cham (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tibshirani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <source>The Elements of Statistical Learning: Data Mining, Inference, and Prediction</source>
          . Springer, New York (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Hausladen</surname>
            ,
            <given-names>C.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schubert</surname>
            ,
            <given-names>M.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ash</surname>
          </string-name>
          , E.:
          <article-title>Text Classi cation of Ideological Direction in Judicial Opinions</article-title>
          .
          <source>International Review of Law and Economics</source>
          <volume>62</volume>
          ,
          <issue>105903</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Hjerppe</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruohonen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Leppanen, V.:
          <article-title>The General Data Protection Regulation: Requirements, Architectures, and Constraints</article-title>
          .
          <source>In: Proceedings of the 27th IEEE International Requirements Engineering Conference (RE</source>
          <year>2019</year>
          ). pp.
          <volume>265</volume>
          {
          <fpage>275</fpage>
          . IEEE,
          <string-name>
            <surname>Jeju Island</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Kuhn</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>caret: Classi cation</article-title>
          and Regression
          <string-name>
            <surname>Training</surname>
          </string-name>
          (
          <year>2020</year>
          ),
          <source>R package version 6</source>
          .
          <fpage>0</fpage>
          -
          <lpage>85</lpage>
          , available online in February: https://cran.r-project.org/web/ packages/caret/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Leith</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>The Rise and Fall of the Legal Expert System</article-title>
          .
          <source>International Review of Law, Computers &amp; Technology</source>
          <volume>30</volume>
          (
          <issue>3</issue>
          ),
          <volume>94</volume>
          {
          <fpage>106</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , H.:
          <article-title>A Predictive Performance Comparison of Machine Learning Models for Judicial Cases</article-title>
          .
          <source>In: Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI</source>
          <year>2017</year>
          ). pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          . IEEE,
          <string-name>
            <surname>Honolulu</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Medvedeva</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vols</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wieling</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Using Machine Learning to Predict Decisions of the European Court of Human Rights. Arti cial Intelligence and Law (Published online in</article-title>
          <source>June)</source>
          ,
          <volume>1</volume>
          {
          <fpage>30</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Mevik</surname>
            ,
            <given-names>B.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wehrens</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <source>The pls Package: Principal Component and Partial Least Squares Regression in R. Journal of Statistical Software</source>
          <volume>18</volume>
          (
          <issue>2</issue>
          ),
          <volume>1</volume>
          {
          <fpage>23</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Mueller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Badiei</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Requiem for a Dream: On Advancing Human Rights via Internet Architecture</article-title>
          .
          <source>Policy and Internet</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <volume>61</volume>
          {
          <fpage>83</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Nemeth</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendricks</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNamara</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , et al.:
          <string-name>
            <surname>Hunspell</surname>
          </string-name>
          (
          <year>2020</year>
          ),
          <source>Version 1.7</source>
          .0, available online in February https://github.com/hunspell/hunspell
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Nissan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Computer Tools and Techniques for Lawyers and the Judiciary</article-title>
          .
          <source>Cybernetics and Systems</source>
          <volume>49</volume>
          (
          <issue>4</issue>
          ),
          <volume>201</volume>
          {
          <fpage>233</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>[22] The Natural Language Toolkit (NLTK): Version 3.4.5</source>
          (
          <issue>2019</issue>
          ), available online in January 2020: http://www.nltk.org
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Ruohonen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Leppanen, V.:
          <article-title>Toward Validation of Textual Information Retrieval Techniques for Software Weaknesses</article-title>
          . In: Elloumi,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Granitzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hameurlain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Seifert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Tjoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.M.</given-names>
            ,
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <surname>R</surname>
          </string-name>
          . (eds.)
          <source>Proceedings of the 29th International Conference on Database and Expert Systems Applications (DEXA</source>
          <year>2018</year>
          ),
          <source>Communications in Computer and Information Science (Volume 903)</source>
          . pp.
          <volume>265</volume>
          {
          <fpage>277</fpage>
          . Springer, Regensburg (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Sleimi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceci</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sannier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sabetzadeh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Briand</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dann</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A Query System for Extracting Requirements-Related Information from Legal Texts</article-title>
          .
          <source>In: Proceedings of the IEEE 27th International Requirements Engineering Conference (RE</source>
          <year>2019</year>
          ). pp.
          <volume>319</volume>
          {
          <fpage>329</fpage>
          . IEEE,
          <string-name>
            <surname>Jeju Island</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Tamburri</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          :
          <article-title>Design Principles for the General Data Protection Regulation (GDPR): A Formal Concept Analysis</article-title>
          and
          <string-name>
            <given-names>Its</given-names>
            <surname>Evaluation</surname>
          </string-name>
          .
          <source>Information Systems</source>
          <volume>91</volume>
          ,
          <issue>101469</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>van Dijk</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tanas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rommetveit</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raab</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Right Engineering?
          <article-title>The Redesign of Privacy and Personal Data Protection</article-title>
          .
          <source>International Review of Law, Computers &amp; Technology</source>
          <volume>32</volume>
          (
          <issue>2</issue>
          {3),
          <volume>230</volume>
          {
          <fpage>256</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Vinocur</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <article-title>: `We Have a Huge Problem': European Tech Regulator Despairs Over Lack of Enforcement: The World's Toughest Privacy Law Proves Toothless in the Eyes of Many Critics (</article-title>
          <year>2019</year>
          ), Politico. Available online in February 2020: https://www.politico.com/news/2019/12/27/ europe-gdpr
          <source>-technology-regulation-089605</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Wagh</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anand</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Legal Document Similarity: A Multi-Criteria DecisionMaking Perspective</article-title>
          .
          <source>PeerJ Computer Science</source>
          <volume>6</volume>
          ,
          <issue>e262</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Zavrsnik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          : Criminal Justice,
          <source>Arti cial Intelligence Systems, and Human Rights. ERA Forum</source>
          <volume>20</volume>
          ,
          <issue>567</issue>
          {
          <fpage>583</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , T.:
          <article-title>foba: Greedy Variable Selection (</article-title>
          <year>2008</year>
          ),
          <source>R package version 0</source>
          .1, available online in February: https://cran.r-project.org/web/packages/foba/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>