<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RegMiner: Taming the Complexity of Regulatory Documents for Digitalized Compliance Management</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Karolin Winter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Gall</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefanie Rinderle-Ma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research Group Work ow Systems and Technology, Faculty of Computer Science</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Business process compliance has become a crucial aspect for companies due to severe nes that can be imposed if constraints and rules emerging from regulatory documents are violated. Regulatory documents are often written in natural language and analyzing them is mainly done manually since only limited tool support is available. Therefore, we present RegMiner, a web service for discovering and visualizing constraints from regulatory documents. By employing NLP and data mining techniques, compliance constraints can be automatically extracted, grouped, and visualized leading to a separation of relevant and nonrelevant document parts and insights into, e.g., duties of stakeholders. A case study based on a current document from the European parliament regarding the nancial domain demonstrates RegMiner's maturity.</p>
      </abstract>
      <kwd-group>
        <kwd>Business Process Compliance</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Regulatory Documents</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction and Signi cance for the BPM Field</title>
      <p>
        Analyzing regulatory documents as well as assessing the compliance of business
processes with regulations poses a tremendous challenge on companies and is
still mostly done manually though the amount of regulatory documents is
constantly increasing. Compliance violations, in turn, can cause severe problems,
e.g., more than 20 million euros in case of violating the General Data Protection
Regulation (GDPR)1. Hence, providing support for digitalized compliance
management is crucial. Yet, only few tools for extracting, analyzing and visualizing
compliance constraints have been proposed. Most tools rather focus on a one
or bidirectional mapping of natural language texts to imperative or declarative
process models [
        <xref ref-type="bibr" rid="ref2 ref3 ref5">2,3,5</xref>
        ]. Hence, we propose RegMiner, a web service for
extracting, processing and visualizing constraints from regulatory documents, based on
previous publications [
        <xref ref-type="bibr" rid="ref7 ref9">7,9</xref>
        ]. Section 2 describes the innovations and architecture
of RegMiner, Sect. 3 elaborates on a case study based on recent document on
macro- nancial assistance provided by the European parliament and council as
well as an outlook on future work. Potential users of RegMiner are legal and
compliance experts as well as researchers.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 Innovation and Architecture</title>
      <p>RegMiner can be accessed via</p>
      <p>http://regminer.wst.univie.ac.at/
A tutorial as well as example data sets are available at</p>
      <p>http://gruppe.wst.univie.ac.at/projects/RegMiner/index.php?t=prototypes
A screencast is available at</p>
      <p>http://gruppe.wst.univie.ac.at/~gallm6/RegMiner/Video/</p>
      <p>
        RegMiner is based on the command line prototypes of previous publications
[
        <xref ref-type="bibr" rid="ref7 ref9">7,9</xref>
        ] and consist of a three tier architecture as depicted in Fig. 1. By o ering
RegMiner as web service we aim at providing a low-threshold entry for analyzing
regulatory documents in an automated way for non-technical experts. But also
technical experts and researches will bene t since RegMiner can be quickly tested
as no installation of libraries is necessary. RegMiner's visualization allows for
easily separating relevant from non-relevant document parts as well as gaining
insights on, e.g., duties of stakeholders in an aggregated form.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Presentation Tier</title>
        <p>The presentation tier consists Presentation Tier
of a single page application rseigcneaivlweoursdesr,ignrpouutp(indgoccuhmoiecnet), language,
based on HTML markup and pdaisspslauyserersinupltustretoceloivgeicdtfireorm logic tier
JavaScript components. In
particular, we use the Bootstrap
itvnoidgoelkdinitbf2oyramtnhdaetdiuo3sn.ejrs.m3.uTsthebfeollporwo-- ppaptrrrnaioegdscsgseedeinsanrsttfNaaoitnLrtimPfoioenacrrmtoaimonandtpiotodonnaeptraneretctseieeivrnetadtivoian Logic Tier cegaxroltNcruuaLpclaPtctocecnocosnotsmrntarsiaptnritnoastinnstegnrtaph
Document. The input can be
provided in two ways. Either as Document DAO: ID, name, [paragraphIDs]Data Tier
ZIP le which can contain a) PCaornafiggruarpahtiDonAOD:AIDO,:nIDa,mger,o[uspeinntgeonpcteiosn], language, signalwordsID, userdefinedtermsID
one document, b) a set of docu- FGirlaepDhADOA: OID:,IfiDl,ednoacmuem,ecnotnIDte,ngtra(cpohntaesnJtScOaNn be list of keywords or user-defined terms)
ments, or c) a set of paragraphs.</p>
        <p>
          Option c) represents a parti- Fig. 1: RegMiner { Architecture
tioning of one or several
documents, e.g., based on the document structure into its sections (cf. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]). Though
this is not mandatory, the partitioning into paragraphs facilitates the visual
inspection of results afterwards. In addition, this way, a restriction to a selection
of document parts can be enforced by the user.
        </p>
        <sec id="sec-2-1-1">
          <title>2 https://getbootstrap.com/ 3 https://d3js.org/</title>
          <p>As a second option, we decided to integrate support for the EUR-Lex
platform4 which contains an extensive collection of legal documents such as the
General Data Protection Legislation a ecting stakeholder of various domains.
The user can provide an URL referencing the HTML markup of the desired
EUR-Lex document. The document is downloaded and automatically split into
sections based on HTML tags and attributes.</p>
          <p>
            Language. English and German are supported as document languages.
Signal Words. At least one signal word to identify constraints, e.g., \shall",
\should" or \must" (cf. [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]), has to be selected.
          </p>
          <p>Grouping Options. Three options for grouping constraints are available
Clustering Constraints are clustered based on term frequency using k-means++;
in this case the user needs to specify the number of clusters.</p>
          <p>Subject determined by Sentence Structure The subject of each constraint is
identi ed based on NLP tags determined by a NLP parser. Per subject one group
is created.5 This option is fully automatic, i.e., requires no further user input.
User-de ned Terms Constraints are grouped based on terms de ned by the user
which need to be uploaded as .txt le. Each term de nes one group, e.g., if
\authority" and \user" are contained in the .txt le, all constraints
containing \authority" are assigned to one group, all constraints containing \user"
are assigned to another group. If neither of them is present, the constraint
is shifted to group \unde ned", if both of them are present, the constraint
will be contained in the \authority" as well as \user" group.</p>
          <p>On submission, the information is passed to the logic tier. As soon as the
results were retrieved, a graph consisting of several dots accumulated in
clusters is displayed. Each dot represents one constraint and the color indicates the
corresponding cluster. By hovering over the dots the constraint is shown. By
double-clicking onto a dot the paragraph containing the constraint is displayed.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Logic Tier</title>
        <p>
          The logic tier is written in Python 3 and consists of two components. The rst
one processes incoming and outgoing requests from and to the presentation tier,
i.e., delivering the single page application to the client's browser, handling user
input and returning constraint graphs using the JSON le format and is based
on the web application framework Flask6. The second component, called NLP
component, conducts the actual constraint extraction and grouping procedure. It
is based on the NLP framework SpaCy [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] for linguistic analysis and scikit-learn
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] for cluster analysis. Choosing SpaCy rests on three pillars, i.e., its accuracy
in determining NLP tags, its built-in similarity function relying on pre-trained
word vectors and its speed (cf. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] for a detailed comparison of NLP parsers). The
information retrieved from the presentation tier as well as the result returned
by the NLP component are passed to the data tier.
4 https://eur-lex.europa.eu/homepage.html?locale=en
5 For further details see [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
6 https://palletsprojects.com/p/flask/
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Data Tier</title>
        <p>For storing and retrieving data RegMiner uses the MongoDB7 database. Each
document uploaded by the user, consists of a name and a list referencing the
corresponding paragraphs, stored in separate les. Each paragraph consists of
a name and content, i.e., the sentences within the paragraph. Signal words as
well as the user-de ned terms are stored in separate les. Furthermore, a
conguration le holding the grouping choice, the language and a reference to the
signal words resp. user-de ned terms le is created. Information stored within
the database becomes identi able via a unique hash-value. This has two bene ts.
First, if a document having the same hash value already exists, it is not stored
in the database multiple times. Secondly, as the results of the NLP component
are also stored in the database, recomputing can be avoided. This enhances the
usability as results can be displayed immediately. The latter is important when
analyzing extensive documents, which can take up to several minutes.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Maturity</title>
      <p>
        The underlying concepts of the NLP
component have proven to yield
valuable insights into regulatory documents
from various domains, such as security,
the medical domain, nancial domain
and generally applicable regulatory
documents like the GDPR [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7,8,9</xref>
        ]. Within this
paper we demonstrate how to gain
insights of regulatory documents that just
came into e ect. In such a situation a user
would not be able to carry out a
purposive search for speci c terms without
having read the document. RegMiner's
bene t is that a user must not have
previous knowledge about a regulatory
document but can still gain insights into a
regulatory document quickly at low ef- Fig. 2: RegMiner { Result for Case
fort. For the case study, we selected the Study on Macro- nancial Assistance
macro- nancial assistance to enlargement
and neighbourhood partners in the context of the COVID-19 pandemic8.
RegMiner took around 8 seconds to discover the graph depicted in Fig. 2 using
\shall", \should" and \must" as signal words and sentence structure as
grouping option. It can be recognized at rst glance that the terms macro- nancial
      </p>
      <sec id="sec-3-1">
        <title>7 https://www.mongodb.com/</title>
        <p>8 https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:
32020D0701&amp;qid=1591613404669&amp;from=DE
assistance and Commission form the largest groups (black and green). By
further examining the underlying constraints by hovering over the dots, the actions
and duties for the commission are immediately apparent and can be easily
summarized. A user interested under which conditions macro- nancial assistance
applies or which aims it has, can directly gain this information based on the
graph. If a user needs the context of the constraint, the corresponding
paragraph can be displayed by double-clicking onto the constraint. Thereby, a user
can directly retrieve the right position within the document. The ability to
retrieve such a graph out-of-the-box for an arbitrary and recent document proves
the applicability of RegMiner. Finally, RegMiner is tested and discussed with
stakeholders from the nancial domain through transfer project RegMiner9.</p>
        <p>As future work we plan to improve the pre-processing of documents, provide
support for document formats such das PDF, integrate further visualization
concepts, especially for complex and extensive documents and investigate domain
speci c language models as well as the integration of ontologies.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgment</title>
      <p>This work has been funded by the Vienna Science and Technology Fund (WWTF)
through project NXT19-003.
9 https://www.wwtf.at/programmes/new_exciting_transfer_projects/NXT19-003</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Al</given-names>
            <surname>Omran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.N.A.</given-names>
            ,
            <surname>Treude</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Choosing an NLP library for analyzing software documentation: a systematic literature review and a series of experiments</article-title>
          .
          <source>In: Mining Software Repositories</source>
          . pp.
          <volume>187</volume>
          {
          <issue>197</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Delicado</given-names>
            <surname>Alcantara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Sanchez</surname>
          </string-name>
          <string-name>
            <surname>Ferreres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Carmona</surname>
          </string-name>
          <string-name>
            <surname>Vargas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Padro</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>Nlp4bpm: Natural language processing tools for business process management</article-title>
          .
          <source>In: BPM Demo and Industrial Track 2017 Proceedings</source>
          . pp.
          <volume>1</volume>
          {
          <issue>5</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Freytag</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allgaier</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Woped goes nlp: Conversion between work ow nets and natural language</article-title>
          . In: BPM (Dissertation/Demos/Industry). pp.
          <volume>101</volume>
          {
          <issue>105</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Honnibal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montani</surname>
          </string-name>
          , I.:
          <article-title>spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Debois</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hildebrandt</surname>
          </string-name>
          , T.T.,
          <string-name>
            <surname>Marquard</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The process highlighter: From texts to declarative processes and back</article-title>
          .
          <source>BPM (Dissertation/Demos/Industry) 2196</source>
          ,
          <fpage>66</fpage>
          {
          <fpage>70</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rinderle-Ma</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Detecting constraints and their relations from regulatory documents using NLP techniques</article-title>
          .
          <source>In: CoopIS</source>
          . pp.
          <volume>261</volume>
          {
          <issue>278</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rinderle-Ma</surname>
          </string-name>
          , S.:
          <article-title>Untangling the GDPR using conrelminer</article-title>
          .
          <source>Tech. rep. (</source>
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1811</year>
          .03399
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rinderle-Ma</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grossmann</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feinerer</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Characterizing regulatory documents and guidelines based on text mining</article-title>
          .
          <source>In: CoopIS</source>
          . pp.
          <volume>3</volume>
          {
          <issue>20</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>