=Paper= {{Paper |id=Vol-3033/paper64 |storemode=property |title=An Obligations Extraction System for Heterogeneous Legal Documents: Building and Evaluating Data and Model |pdfUrl=https://ceur-ws.org/Vol-3033/paper64.pdf |volume=Vol-3033 |authors=Maria Iacono,Laura Rossi,Paolo Dangelo,Andrea Tesei,Lorenzo De Mattei |dblpUrl=https://dblp.org/rec/conf/clic-it/IaconoRDTM21 }} ==An Obligations Extraction System for Heterogeneous Legal Documents: Building and Evaluating Data and Model== https://ceur-ws.org/Vol-3033/paper64.pdf
    An Obligations Extraction System for Heterogeneous Legal Documents:
                  Building and Evaluating Data and Model

        Maria Iacono, Laura Rossi, Paolo Dangelo, Andrea Tesei, Lorenzo De Mattei
                                   Aptus.AI / Pisa, Italy
                {maria,laura,paolo,andrea,lorenzo}@aptus.ai



                        Abstract                                pensive especially if the annotations require legal
                                                                domain experts.
     A system that extracts obligations auto-                      The obligations extraction topic has been al-
     matically from heterogeneous regulations                   ready studied with different approaches. Bartolini
     could be of great help for a variety of                    et al. (2004) used a shallow syntactic parser and
     stakeholders including financial institu-                  hand-crafted rules to automatically classify laws
     tions. In order to reach this goal, we pro-                paragraphs according to their regulatory content
     pose a methodology to build a training set                 and extract relevant text fragments corresponding
     of regulations written in Italian coming                   to specific semantic roles. Similarly Sleimi et al.
     from a set of different legal sources and a                (2018) represent automatically legal texts seman-
     system based on a Transformer language                     tics using an RDF schema with a system based
     model to solve this task. More impor-                      on a dependency parser and hand-crafted rules.
     tantly, we deep dive into the process of hu-               Sleimi et al. (2019) used the same representation
     man and machine-learned annotations by                     to build a question-answering system with a focus
     carrying out both quantitative and manual                  on obligations. Biagioli et al. (2005) represent law
     evaluations of both of them.                               paragraphs using Bag of words either with TF or
                                                                TF-IDF weighting (Salton and Buckley, 1988) and
                                                                used Support Vector Machines (SVM) to classify
                                                                each paragraph as a type of provisioning includ-
1    Introduction                                               ing obligations. A similar approach is adopted
Compliance practitioners in financial intuitions are            by Francesconi and Passerini (2007): they clas-
overburdened by the high volume of upcoming                     sify legislative texts paragraphs according to the
regulations coming from different legal sources,                proposed provision model. They represent them
such as the European Union, National legislation,               in a similar way as (Biagioli et al., 2005) and use
central banks and independent administrative au-                two learning algorithms: Naive Bayes and SVM.
thorities sources, to name a few. Part of the com-              Sleimi et al. (2020), propose to address the prob-
pliance offices work consists of extracting obliga-             lem of the complexity of regulatory texts by writ-
tions from this vast amount of regulations to trig-             ing them following a set of standard templates
ger compliance processes. It is worth noting that               which could be easily parsed.
extracting obligations from such a big amount of
regulations is tedious and repetitive work. In this             Contributions In this work we offer four main
scenario having systems to automate this process                contributions. (i) We propose a methodology for
might be very useful to cut down the costs. Ma-                 building training corpora relying on non-expert
chine Learning (ML) and Natural Language Pro-                   annotators and we apply this methodology on a
cessing (NLP) may come in help. However, given                  set of heterogeneous regulations written in Italian,
the variety of legal sources, training this kind of             coming from a set of different legal sources. (ii)
system is a complex activity because it requires a              We assess the quality of the introduced methodol-
sufficient amount of annotated data, which are ex-              ogy relying on an inter-annotator agreement score
                                                                and we carry out an error analysis to highlight if
     Copyright © 2021 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       and when expert annotators are required. (iii) We
ternational (CC BY 4.0).                                        use the dataset produced to train and test an obli-
gations classification system based on neural net-         this, there are cases in which the language can be
works as this approach has been proven to pro-             ambiguous. Since our goal is to build a dataset
vides state of the art results for several Italian clas-   in line with compliance practitioners expectations
sification tasks (De Mattei et al., 2018; Cimino et        we analyzed some special cases with a group of
al., 2018; Occhipinti et al., 2020). (v) We conduct        experts in order to provide clear guidelines to an-
a manual error analysis to investigate the pros and        notators.
the limitations of the mentioned system.                      One such case is when an obligation is ex-
                                                           pressed indirectly, for example through the formu-
2     Task Description
                                                           lation of a right. If an article talks about rights of
The task we tackle consists of classifying regula-         any kind, it assumes that those rights must be re-
tions clauses either as obligations or not. By obli-       spected. So, for example, the right of a client in
gation, we mean, from a juridical point of view, a         terms of obtaining a loan (client’s point of view)
legal constraint imposed by law and addressed to           corresponds to a duty of the bank, which is obliged
a juridical person.                                        to grant it if the client has what it takes (bank’s
   Being interested in developing a system that            point of view). Similarly, an employee’s right to
supports financial institutions, we distinguish two        go on vacation means that the employer must guar-
categories of obligations, classifying them as rel-        antee vacation days. For this reason, in deciding
evant or irrelevant for financial institutions. Then       how to classify a part of a law, in addition to the
each clause can be classified in one out of the fol-       interpretation by the annotator, the concept of ”pri-
lowing three categories: (i) not obligation,               ority” comes into play. Since our application is
(ii) relevant obligation and (iii) not                     designed to support financial institutions, our pri-
relevant obligation. This classification                   ority is to highlight the obligations that they must
schema allows practitioners to retrieve in one click       take into account in order not to risk penalties.
all the obligations or the relevant only so that they      Consequently, if a sentence represents both a right
can decide whether to have a complete overview of          for one subject and duty for another, we prioritize
the laws they are consulting or to focus only on the       the obligation in classifying it.
obligations that actually affect their institutions.          Another case where the priority factor comes
   To distinguish the two categories, we look at the       into play is that of clauses that contain both rel-
subject to whom the obligation is addressed: if it         evant and irrelevant obligations. In these cases,
is a public institution, we classify it as an irrel-       since we cannot break the clause down into several
evant obligation, in all other cases as a relevant         parts, we give priority to the relevant obligation.
obligation. This simplification applied to the clas-       In terms of risk, it is better to classify an irrelevant
sification criterion may seem extreme since it im-         obligation as relevant, rather than the other way
plies that any type of obligation not addressed to a       around.
public institution must be considered relevant for a          In addition, we have to consider that obligations
financial institution. However, we believe that ap-        may be reported implicitly. For example, if a per-
plying this distinction is a good strategy because         son can perform an action only under certain con-
the documents we analyze are already filtered, i.e.,       ditions, it is implied that those conditions can be
they belong to a category of laws that impact fi-          interpreted as obligations. According to this prin-
nancial institutions. Consequently, within them, if        ciple, we do not classify a sentence such as “Spec-
an obligation is not directed at a public institution      tators may enter the theatre” as an obligation. On
it will almost certainly be directed somehow to fi-        the contrary, we do so when a condition is added,
nancial institutions.                                      as in the case of the sentence “Spectators may en-
                                                           ter the theatre only if they have the ticket.”
                                                               Even if we, as readers, do not pay attention to
2.1    Special Cases                                       it, normative texts often contain implicit informa-
Legal jargon is not merely a tool used for argu-           tion that readers are naturally able to trace through
mentation or narrative, but a constitutive element         reading, such as an implied subject, or a reference
of the law. Consequently, the structure of legal           to another part of the document or to an external
texts has particular characteristics that must re-         document. Unlike a reader, an automatic classifier,
spond to precise and predictable patterns. Despite         not having provided with enough context, may en-
counter difficulties in handling this kind of case.     institutions, and dark blue if it is not relevant.

3       Data Annotation
We extracted the dataset from Daitomic1 , a prod-
uct that automatically collects legal documents
from a wide variety of legal sources, represents au-
tomatically them accordingly to the Akoma Ntoso
standard (Palmirani and Vitali, 2011) and makes
them available through a dedicated User Interface.
The adoption of Akoma Ntoso lets us represent the
                                                        Figure 1: Pop-up for setting the label of the obli-
structure of heterogeneous legal texts in a unified
                                                        gation.
format that makes us able to apply the same op-
erations on very different kind of poorly encoded
                                                           We picked four of the annotated laws contain-
documents such as PDF, HTML and DOCX files.
                                                        ing as many as 2189 clauses to be annotated by all
   The corpus has been manually labelled by three
                                                        three annotators.
trained annotators with no previous background in
legal domain and contains 71 regulations for a to-
                                                        4   Annotations Evaluation
tal of 10.628 clauses. We selected regulations that
touch heterogeneous topics such as data privacy,        We used the part of the dataset annotated by all
financial risk, tax compliance and many more but        three annotators in order to calculate the inter-
all of them are known to be relevant for financial      annotator agreement (IAA). Using Krippendorff’s
institutions. In order to deal with the problem of      Alpha reliability, we computed IAA in two dif-
heterogeneity of normative sources, we found it         ferent ways, at first checking only whether they
appropriate to take texts from different sources, so    had classified the sentences as obligations or non-
that we could train the model in a balanced way.        obligations, then taking into account their choices
In particular, we extracted the texts from thirty of    in distinguishing obligations between relevant and
the most important regulatory sources for Italian       non-relevant. The resulting IAA is 0.58 consider-
financial institutions, including Gazzetta Ufficiale    ing the distinction between relevant and not rele-
Italiana, EUR-Lex, Consob, Banca d’Italia and           vant but increases to 0.70 if no such distinction is
many more. From these sources, we selected texts        applied.
of different types: acts, regulations, decisions, di-      In order to better understand these results we
rectives, communications, statutes, and more. In        carried out a manual analysis from which turned
this way, we created a very heterogeneous dataset       out that most cases of disagreement are of two
that can be considered representative of the wide       kinds (two examples are reported in Table 1). The
variety of existing regulations.                        lack of agreement between annotators can be pri-
   The annotations were carried out directly from       marily attributed to the fact that there is often no
the graphical user interface of the Daitomic ap-        explicitly expressed subject in a clause, either be-
plication, which allows, within the consultation        cause it is expressed in the preceding clauses or
section, to mark the requirements present in the        because it is intuitable from the context, as we can
law and to classify them as relevant or not rele-       see in the first example. Another frequent reason
vant. The application texts are already structured,     for disagreement is surely the fact that our anno-
so they present a tree structure divided into chap-     tators, not being experts in the legal field, not al-
ters, articles, paragraphs, clauses, etc, where we      ways are able to understand the kind of subject to
annotated the smallest parts, i.e. clauses. Each        which the obligation is referred, as in the second
clause is flanked by a sidebar, clicking on which       example. In such cases, expert annotators might
automatically opens the pop-up shown in Figure          be more reliable.
1, which allows the annotators to choose the label
that they consider most appropriate. As a result        5   Automatic Classifier
of this choice, the sidebar will turn light blue if
                                                        We also used the dataset we built to train an au-
the obligation is classified as relevant to financial
                                                        tomatic classifier. We split the dataset into train-
    1
        https://www.daitomic.com/                       ing (90%) and test (10%) sets. As a learning
 Annotator 1 Annotator 2 Annotator 3 text
 not relevant    relevant         relevant        I contratti di assicurazione di cui al comma 1, lettera b),
                                                  sono corredati da un regolamento, redatto in base alle
                                                  direttive impartite dalla COVIP [...]
                                                  en:[The insurance contracts referred to in paragraph
                                                  1, letter b), are accompanied by a regulation, drawn up
                                                  on the basis of the directives issued by COVIP [...]]
 relevant        relevant         not relevant    Il soggetto incaricato del collocamento nel territorio
                                                  dello Stato provvede altresi’ agli adempimenti stabil-
                                                  iti [...]
                                                  en:[The person in charge of placement in the territory
                                                  of the The State also provides for the established obli-
                                                  gations [...]]


Table 1: Example of disagreement among annotators. Correct classifications are shown in blue while
incorrect classifications are shown in red.

                            Precision Recall F-Score                           Precision Recall F-Score
 Not Obligations                0.96   0.98      0.97        Not Obligations        0.96    0.98      0.97
 Relevant Obligations           0.67   0.63      0.65        Obligations            0.95    0.87      0.91
 Not Relevant Obligations       0.84   0.76      0.80
                                                         Table 3: System performances evaluation on the
Table 2: System performances evaluation on the           test set with no distinguish between relevant and
test set                                                 not relevant obligations

model, we used UmBERTo2 , an Italian pretrained
                                                         suggesting that the systems, similarly to the an-
Language Model trained by Musixmatch based
                                                         notators, performs well in identifying obligations,
on Roberta architecture (Liu et al., 2019), which
                                                         but struggles in distinguishing between relevant
has been recently proved to provide state of the
                                                         and not relevant obligations.
art performances for other Italian tasks (Occhip-
inti et al., 2020; Sarti, 2020; Giorgioni et al.,
                                                         6   Human vs Automatic Classification
2020). This language model has 12-layer, 768-
hidden, 12-heads, 110M parameters. On top of             In order to better understand the model capabil-
the language model, we added a ReLU classifier           ities, we ran a manual error analysis, comparing
(Nair and Hinton, 2010). All the model’s weights         human annotations against automatic classifica-
has been updated during fine-tuning. We applied          tions on the test set. We identified some categories
dropout (Srivastava et al., 2014) with probability       of typical errors and reported some examples in
0.1 to both the attention and the hidden layers.         Table 4. In some cases, the errors of the model
We used Cross-Entropy as a loss function and we          are attributable to the non-explicit subject, which
trained the system until early-stop at epoch 6. The      the human annotator can derive from the context,
performances obtained on the test set are reported       as can be seen in the first example, where it is not
in Table 2. The system performances are fairly           explicitly specified who should enter the data in
good if compared to IAA but not enough reliable          the communication. Looking at the second exam-
to be used in real-world scenarios. However if we        ple, we can see a sentence whose main message is
evaluate the system without considering the differ-      the expression of a right, in this case, the right to
ence between not relevant and relevant obligations       access a certain file. However, access to the file is
(Table 3) we observe much more accurate results          allowed only under certain temporal conditions (at
  2
    https://github.com/musixmatchresearc                 the conclusion of the appeal procedure), so behind
h/umberto                                                that right is hidden a relevant obligation. Unfortu-
    Human       Machine     text
    not relevant relevant   Nella comunicazione di avvio di cui al comma 2 sono indicati l’oggetto del
                            procedimento, gli elementi acquisiti d’ufficio [...]
                            en:[In the communication of initiation referred to in paragraph 2 are indi-
                            cated the subject of the procedure, the elements acquired ex officio [...]]
    relevant    none        L’accesso al fascicolo è consentito a conclusione della procedura di inter-
                            pello ai fini della tutela in sede giurisdizionale.
                            en:[Access to the file is granted at the conclusion of the appeal procedure
                            for judicial protection purposes.]
    relevant    none        E’ considerata ingannevole la pubblicità’, che, in quanto suscettibile di rag-
                            giungere bambini ed adolescenti, può’, anche indirettamente, minacciare la
                            loro sicurezza.
                            en:[Advertising that is likely to reach children and adolescents and that may
                            even indirectly threaten their safety is considered misleading.]
    relevant    not relevant Le amministrazioni interessate provvedono agli adempimenti previsti dal
                             presente decreto con le risorse umane, finanziarie e strumentali disponibili
                             [...].
                             en:[The administrations involved shall carry out the obligations provided
                             for in this decree with the human, financial and instrumental resources
                             available.[...]]
    relevant    none        Il presente decreto reca le disposizioni di attuazione dell’articolo 1 del de-
                            creto legge 6 dicembre 2011, n. 201, convertito, con modificazioni, dalla
                            legge 22 dicembre 2011, n. 214 [...].
                            en:[This decree contains the provisions for the implementation of article 1
                            of Law Decree no. 201 of December 6, 2011, converted, with amendments,
                            by Law no. 214 of December 22, 2011 [...]]

Table 4: Example of disagreement between manual (Human) and automatic (Machine) annotations.
Correct classifications are shown in blue while incorrect classifications are shown in red.

nately in these cases, the model is often wrong.        We apply this methodology to a set of heteroge-
Another difficult case to handle is the one shown       neous regulations from a collection of different le-
in the third example in Table 4. This is a sentence     gal sources. IAA and a manual error analysis high-
that apparently contains simple information: ad-        light that human annotation is in general prone
vertising is considered deceptive if it can threaten    to errors and that non-expert annotators struggle
the safety of children. But behind this message         to distinguish between relevant and not relevant
lies an obligation on advertisers to avoid such a       obligations. The dataset produced has been used
situation. Again, the obligation is not explicit, so    to train and test an obligations classification sys-
it is quite understandable that the model could be      tem based on state-of-the-art pretrained language
wrong. Finally, the last two examples show hu-          models. We conduct both an automatic evaluation
man errors, and it was noted with some interest         and a manual error analysis from which turned out
that where annotators make errors due to distrac-       that the system, similarly to human annotators, has
tion or misunderstanding, the model often classi-       good performances in recognizing obligations but
fies correctly.                                         struggles in distinguish between relevant and not.
                                                        As future works, we plan to involve domain-expert
7     Conclusions                                       annotators to evaluate if their contribution can im-
                                                        prove the quality of the data and of the model.
In this work we propose a methodology for build-
                                                        Also, we will explore techniques to provide more
ing training corpora for obligations classification,
                                                        context to the classifier in order to improve the per-
based on annotations performed by non-experts.
formances on clauses in which the subject is im-             Gabriele Sarti. 2020. Umberto-mtsa@ accompl-it:
plied.                                                         Improving complexity and acceptability prediction
                                                               with multi-task learning on self-supervised annota-
                                                               tions. arXiv preprint arXiv:2011.05197.
References                                                   Amin Sleimi, Nicolas Sannier, Mehrdad Sabetzadeh,
Roberto Bartolini, Alessandro Lenci, Simonetta Mon-           Lionel Briand, and John Dann. 2018. Automated
  temagni, Vito Pirrelli, and Claudia Soria. 2004.            extraction of semantic legal metadata using natural
  Automatic classification and analysis of provisions         language processing. In 2018 IEEE 26th Interna-
  in italian legal texts: a case study. In OTM Con-           tional Requirements Engineering Conference (RE),
  federated International Conferences” On the Move            pages 124–135. IEEE.
  to Meaningful Internet Systems”, pages 593–604.
  Springer.                                                  Amin Sleimi, Marcello Ceci, Nicolas Sannier,
                                                              Mehrdad Sabetzadeh, Lionel Briand, and John
Carlo Biagioli, Enrico Francesconi, Andrea Passerini,         Dann. 2019. A query system for extracting
  Simonetta Montemagni, and Claudia Soria. 2005.              requirements-related information from legal texts.
  Automatic semantics extraction in law documents.            In 2019 IEEE 27th International Requirements En-
  In Proceedings of the 10th international conference         gineering Conference (RE), pages 319–329. IEEE.
  on Artificial intelligence and law, pages 133–140.
                                                             Amin Sleimi, Marcello Ceci, Mehrdad Sabetzadeh,
Andrea Cimino, Lorenzo De Mattei, and Felice                  Lionel C Briand, and John Dann. 2020. Auto-
  Dell’Orletta. 2018. Multi-task learning in deep             mated recommendation of templates for legal re-
  neural networks at evalita 2018. Proceedings of             quirements. In 2020 IEEE 28th International Re-
  the Wvaluation Campaign of Natural Language Pro-            quirements Engineering Conference (RE), pages
  cessing and Speech tools for Italian, pages 86–95.          158–168. IEEE.
Lorenzo De Mattei, Andrea Cimino, and Felice                 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
  Dell’Orletta. 2018. Multi-task learning in deep neu-         Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
  ral network for sentiment polarity and irony classifi-       Dropout: a simple way to prevent neural networks
  cation. In NL4AI@ AI* IA, pages 76–82.                       from overfitting. The journal of machine learning
                                                               research, 15(1):1929–1958.
Enrico Francesconi and Andrea Passerini. 2007. Auto-
  matic classification of provisions in legislative texts.
  Artificial Intelligence and Law, 15(1):1–17.
Simone Giorgioni, Marcello Politi, Samir Salman,
  Roberto Basili, and Danilo Croce. 2020. Unitor@
  sardistance2020: Combining transformer-based ar-
  chitectures and transfer learning for robust stance
  detection. In EVALITA.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
  Luke Zettlemoyer, and Veselin Stoyanov. 2019.
  Roberta: A robustly optimized bert pretraining ap-
  proach. arXiv preprint arXiv:1907.11692.
Vinod Nair and Geoffrey E Hinton. 2010. Rectified
  linear units improve restricted boltzmann machines.
  In ICML.
Daniela Occhipinti, Andrea Tesei, Maria Iacono, Carlo
  Aliprandi, Lorenzo De Mattei, and Aptus AI. 2020.
  Italianlp@ tag-it: Umberto for author profiling at
  tag-it 2020. In Proceedings of Seventh Evalua-
  tion Campaign of Natural Language Processing and
  Speech Tools for Italian. Final Workshop (EVALITA
  2020), Online. CEUR. org.
Monica Palmirani and Fabio Vitali, 2011. Akoma-
 Ntoso for Legal Documents, pages 75–100. Springer
 Netherlands, Dordrecht.
Gerard Salton and Christopher Buckley. 1988. Term-
  weighting approaches in automatic text retrieval. In-
  formation Processing & Management, 24(5):513–
  523.