=Paper=
{{Paper
|id=Vol-3033/paper64
|storemode=property
|title=An Obligations Extraction System for Heterogeneous Legal Documents: Building and Evaluating Data and Model
|pdfUrl=https://ceur-ws.org/Vol-3033/paper64.pdf
|volume=Vol-3033
|authors=Maria Iacono,Laura Rossi,Paolo Dangelo,Andrea Tesei,Lorenzo De Mattei
|dblpUrl=https://dblp.org/rec/conf/clic-it/IaconoRDTM21
}}
==An Obligations Extraction System for Heterogeneous Legal Documents: Building and Evaluating Data and Model==
An Obligations Extraction System for Heterogeneous Legal Documents: Building and Evaluating Data and Model Maria Iacono, Laura Rossi, Paolo Dangelo, Andrea Tesei, Lorenzo De Mattei Aptus.AI / Pisa, Italy {maria,laura,paolo,andrea,lorenzo}@aptus.ai Abstract pensive especially if the annotations require legal domain experts. A system that extracts obligations auto- The obligations extraction topic has been al- matically from heterogeneous regulations ready studied with different approaches. Bartolini could be of great help for a variety of et al. (2004) used a shallow syntactic parser and stakeholders including financial institu- hand-crafted rules to automatically classify laws tions. In order to reach this goal, we pro- paragraphs according to their regulatory content pose a methodology to build a training set and extract relevant text fragments corresponding of regulations written in Italian coming to specific semantic roles. Similarly Sleimi et al. from a set of different legal sources and a (2018) represent automatically legal texts seman- system based on a Transformer language tics using an RDF schema with a system based model to solve this task. More impor- on a dependency parser and hand-crafted rules. tantly, we deep dive into the process of hu- Sleimi et al. (2019) used the same representation man and machine-learned annotations by to build a question-answering system with a focus carrying out both quantitative and manual on obligations. Biagioli et al. (2005) represent law evaluations of both of them. paragraphs using Bag of words either with TF or TF-IDF weighting (Salton and Buckley, 1988) and used Support Vector Machines (SVM) to classify each paragraph as a type of provisioning includ- 1 Introduction ing obligations. A similar approach is adopted Compliance practitioners in financial intuitions are by Francesconi and Passerini (2007): they clas- overburdened by the high volume of upcoming sify legislative texts paragraphs according to the regulations coming from different legal sources, proposed provision model. They represent them such as the European Union, National legislation, in a similar way as (Biagioli et al., 2005) and use central banks and independent administrative au- two learning algorithms: Naive Bayes and SVM. thorities sources, to name a few. Part of the com- Sleimi et al. (2020), propose to address the prob- pliance offices work consists of extracting obliga- lem of the complexity of regulatory texts by writ- tions from this vast amount of regulations to trig- ing them following a set of standard templates ger compliance processes. It is worth noting that which could be easily parsed. extracting obligations from such a big amount of regulations is tedious and repetitive work. In this Contributions In this work we offer four main scenario having systems to automate this process contributions. (i) We propose a methodology for might be very useful to cut down the costs. Ma- building training corpora relying on non-expert chine Learning (ML) and Natural Language Pro- annotators and we apply this methodology on a cessing (NLP) may come in help. However, given set of heterogeneous regulations written in Italian, the variety of legal sources, training this kind of coming from a set of different legal sources. (ii) system is a complex activity because it requires a We assess the quality of the introduced methodol- sufficient amount of annotated data, which are ex- ogy relying on an inter-annotator agreement score and we carry out an error analysis to highlight if Copyright © 2021 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- and when expert annotators are required. (iii) We ternational (CC BY 4.0). use the dataset produced to train and test an obli- gations classification system based on neural net- this, there are cases in which the language can be works as this approach has been proven to pro- ambiguous. Since our goal is to build a dataset vides state of the art results for several Italian clas- in line with compliance practitioners expectations sification tasks (De Mattei et al., 2018; Cimino et we analyzed some special cases with a group of al., 2018; Occhipinti et al., 2020). (v) We conduct experts in order to provide clear guidelines to an- a manual error analysis to investigate the pros and notators. the limitations of the mentioned system. One such case is when an obligation is ex- pressed indirectly, for example through the formu- 2 Task Description lation of a right. If an article talks about rights of The task we tackle consists of classifying regula- any kind, it assumes that those rights must be re- tions clauses either as obligations or not. By obli- spected. So, for example, the right of a client in gation, we mean, from a juridical point of view, a terms of obtaining a loan (client’s point of view) legal constraint imposed by law and addressed to corresponds to a duty of the bank, which is obliged a juridical person. to grant it if the client has what it takes (bank’s Being interested in developing a system that point of view). Similarly, an employee’s right to supports financial institutions, we distinguish two go on vacation means that the employer must guar- categories of obligations, classifying them as rel- antee vacation days. For this reason, in deciding evant or irrelevant for financial institutions. Then how to classify a part of a law, in addition to the each clause can be classified in one out of the fol- interpretation by the annotator, the concept of ”pri- lowing three categories: (i) not obligation, ority” comes into play. Since our application is (ii) relevant obligation and (iii) not designed to support financial institutions, our pri- relevant obligation. This classification ority is to highlight the obligations that they must schema allows practitioners to retrieve in one click take into account in order not to risk penalties. all the obligations or the relevant only so that they Consequently, if a sentence represents both a right can decide whether to have a complete overview of for one subject and duty for another, we prioritize the laws they are consulting or to focus only on the the obligation in classifying it. obligations that actually affect their institutions. Another case where the priority factor comes To distinguish the two categories, we look at the into play is that of clauses that contain both rel- subject to whom the obligation is addressed: if it evant and irrelevant obligations. In these cases, is a public institution, we classify it as an irrel- since we cannot break the clause down into several evant obligation, in all other cases as a relevant parts, we give priority to the relevant obligation. obligation. This simplification applied to the clas- In terms of risk, it is better to classify an irrelevant sification criterion may seem extreme since it im- obligation as relevant, rather than the other way plies that any type of obligation not addressed to a around. public institution must be considered relevant for a In addition, we have to consider that obligations financial institution. However, we believe that ap- may be reported implicitly. For example, if a per- plying this distinction is a good strategy because son can perform an action only under certain con- the documents we analyze are already filtered, i.e., ditions, it is implied that those conditions can be they belong to a category of laws that impact fi- interpreted as obligations. According to this prin- nancial institutions. Consequently, within them, if ciple, we do not classify a sentence such as “Spec- an obligation is not directed at a public institution tators may enter the theatre” as an obligation. On it will almost certainly be directed somehow to fi- the contrary, we do so when a condition is added, nancial institutions. as in the case of the sentence “Spectators may en- ter the theatre only if they have the ticket.” Even if we, as readers, do not pay attention to 2.1 Special Cases it, normative texts often contain implicit informa- Legal jargon is not merely a tool used for argu- tion that readers are naturally able to trace through mentation or narrative, but a constitutive element reading, such as an implied subject, or a reference of the law. Consequently, the structure of legal to another part of the document or to an external texts has particular characteristics that must re- document. Unlike a reader, an automatic classifier, spond to precise and predictable patterns. Despite not having provided with enough context, may en- counter difficulties in handling this kind of case. institutions, and dark blue if it is not relevant. 3 Data Annotation We extracted the dataset from Daitomic1 , a prod- uct that automatically collects legal documents from a wide variety of legal sources, represents au- tomatically them accordingly to the Akoma Ntoso standard (Palmirani and Vitali, 2011) and makes them available through a dedicated User Interface. The adoption of Akoma Ntoso lets us represent the Figure 1: Pop-up for setting the label of the obli- structure of heterogeneous legal texts in a unified gation. format that makes us able to apply the same op- erations on very different kind of poorly encoded We picked four of the annotated laws contain- documents such as PDF, HTML and DOCX files. ing as many as 2189 clauses to be annotated by all The corpus has been manually labelled by three three annotators. trained annotators with no previous background in legal domain and contains 71 regulations for a to- 4 Annotations Evaluation tal of 10.628 clauses. We selected regulations that touch heterogeneous topics such as data privacy, We used the part of the dataset annotated by all financial risk, tax compliance and many more but three annotators in order to calculate the inter- all of them are known to be relevant for financial annotator agreement (IAA). Using Krippendorff’s institutions. In order to deal with the problem of Alpha reliability, we computed IAA in two dif- heterogeneity of normative sources, we found it ferent ways, at first checking only whether they appropriate to take texts from different sources, so had classified the sentences as obligations or non- that we could train the model in a balanced way. obligations, then taking into account their choices In particular, we extracted the texts from thirty of in distinguishing obligations between relevant and the most important regulatory sources for Italian non-relevant. The resulting IAA is 0.58 consider- financial institutions, including Gazzetta Ufficiale ing the distinction between relevant and not rele- Italiana, EUR-Lex, Consob, Banca d’Italia and vant but increases to 0.70 if no such distinction is many more. From these sources, we selected texts applied. of different types: acts, regulations, decisions, di- In order to better understand these results we rectives, communications, statutes, and more. In carried out a manual analysis from which turned this way, we created a very heterogeneous dataset out that most cases of disagreement are of two that can be considered representative of the wide kinds (two examples are reported in Table 1). The variety of existing regulations. lack of agreement between annotators can be pri- The annotations were carried out directly from marily attributed to the fact that there is often no the graphical user interface of the Daitomic ap- explicitly expressed subject in a clause, either be- plication, which allows, within the consultation cause it is expressed in the preceding clauses or section, to mark the requirements present in the because it is intuitable from the context, as we can law and to classify them as relevant or not rele- see in the first example. Another frequent reason vant. The application texts are already structured, for disagreement is surely the fact that our anno- so they present a tree structure divided into chap- tators, not being experts in the legal field, not al- ters, articles, paragraphs, clauses, etc, where we ways are able to understand the kind of subject to annotated the smallest parts, i.e. clauses. Each which the obligation is referred, as in the second clause is flanked by a sidebar, clicking on which example. In such cases, expert annotators might automatically opens the pop-up shown in Figure be more reliable. 1, which allows the annotators to choose the label that they consider most appropriate. As a result 5 Automatic Classifier of this choice, the sidebar will turn light blue if We also used the dataset we built to train an au- the obligation is classified as relevant to financial tomatic classifier. We split the dataset into train- 1 https://www.daitomic.com/ ing (90%) and test (10%) sets. As a learning Annotator 1 Annotator 2 Annotator 3 text not relevant relevant relevant I contratti di assicurazione di cui al comma 1, lettera b), sono corredati da un regolamento, redatto in base alle direttive impartite dalla COVIP [...] en:[The insurance contracts referred to in paragraph 1, letter b), are accompanied by a regulation, drawn up on the basis of the directives issued by COVIP [...]] relevant relevant not relevant Il soggetto incaricato del collocamento nel territorio dello Stato provvede altresi’ agli adempimenti stabil- iti [...] en:[The person in charge of placement in the territory of the The State also provides for the established obli- gations [...]] Table 1: Example of disagreement among annotators. Correct classifications are shown in blue while incorrect classifications are shown in red. Precision Recall F-Score Precision Recall F-Score Not Obligations 0.96 0.98 0.97 Not Obligations 0.96 0.98 0.97 Relevant Obligations 0.67 0.63 0.65 Obligations 0.95 0.87 0.91 Not Relevant Obligations 0.84 0.76 0.80 Table 3: System performances evaluation on the Table 2: System performances evaluation on the test set with no distinguish between relevant and test set not relevant obligations model, we used UmBERTo2 , an Italian pretrained suggesting that the systems, similarly to the an- Language Model trained by Musixmatch based notators, performs well in identifying obligations, on Roberta architecture (Liu et al., 2019), which but struggles in distinguishing between relevant has been recently proved to provide state of the and not relevant obligations. art performances for other Italian tasks (Occhip- inti et al., 2020; Sarti, 2020; Giorgioni et al., 6 Human vs Automatic Classification 2020). This language model has 12-layer, 768- hidden, 12-heads, 110M parameters. On top of In order to better understand the model capabil- the language model, we added a ReLU classifier ities, we ran a manual error analysis, comparing (Nair and Hinton, 2010). All the model’s weights human annotations against automatic classifica- has been updated during fine-tuning. We applied tions on the test set. We identified some categories dropout (Srivastava et al., 2014) with probability of typical errors and reported some examples in 0.1 to both the attention and the hidden layers. Table 4. In some cases, the errors of the model We used Cross-Entropy as a loss function and we are attributable to the non-explicit subject, which trained the system until early-stop at epoch 6. The the human annotator can derive from the context, performances obtained on the test set are reported as can be seen in the first example, where it is not in Table 2. The system performances are fairly explicitly specified who should enter the data in good if compared to IAA but not enough reliable the communication. Looking at the second exam- to be used in real-world scenarios. However if we ple, we can see a sentence whose main message is evaluate the system without considering the differ- the expression of a right, in this case, the right to ence between not relevant and relevant obligations access a certain file. However, access to the file is (Table 3) we observe much more accurate results allowed only under certain temporal conditions (at 2 https://github.com/musixmatchresearc the conclusion of the appeal procedure), so behind h/umberto that right is hidden a relevant obligation. Unfortu- Human Machine text not relevant relevant Nella comunicazione di avvio di cui al comma 2 sono indicati l’oggetto del procedimento, gli elementi acquisiti d’ufficio [...] en:[In the communication of initiation referred to in paragraph 2 are indi- cated the subject of the procedure, the elements acquired ex officio [...]] relevant none L’accesso al fascicolo è consentito a conclusione della procedura di inter- pello ai fini della tutela in sede giurisdizionale. en:[Access to the file is granted at the conclusion of the appeal procedure for judicial protection purposes.] relevant none E’ considerata ingannevole la pubblicità’, che, in quanto suscettibile di rag- giungere bambini ed adolescenti, può’, anche indirettamente, minacciare la loro sicurezza. en:[Advertising that is likely to reach children and adolescents and that may even indirectly threaten their safety is considered misleading.] relevant not relevant Le amministrazioni interessate provvedono agli adempimenti previsti dal presente decreto con le risorse umane, finanziarie e strumentali disponibili [...]. en:[The administrations involved shall carry out the obligations provided for in this decree with the human, financial and instrumental resources available.[...]] relevant none Il presente decreto reca le disposizioni di attuazione dell’articolo 1 del de- creto legge 6 dicembre 2011, n. 201, convertito, con modificazioni, dalla legge 22 dicembre 2011, n. 214 [...]. en:[This decree contains the provisions for the implementation of article 1 of Law Decree no. 201 of December 6, 2011, converted, with amendments, by Law no. 214 of December 22, 2011 [...]] Table 4: Example of disagreement between manual (Human) and automatic (Machine) annotations. Correct classifications are shown in blue while incorrect classifications are shown in red. nately in these cases, the model is often wrong. We apply this methodology to a set of heteroge- Another difficult case to handle is the one shown neous regulations from a collection of different le- in the third example in Table 4. This is a sentence gal sources. IAA and a manual error analysis high- that apparently contains simple information: ad- light that human annotation is in general prone vertising is considered deceptive if it can threaten to errors and that non-expert annotators struggle the safety of children. But behind this message to distinguish between relevant and not relevant lies an obligation on advertisers to avoid such a obligations. The dataset produced has been used situation. Again, the obligation is not explicit, so to train and test an obligations classification sys- it is quite understandable that the model could be tem based on state-of-the-art pretrained language wrong. Finally, the last two examples show hu- models. We conduct both an automatic evaluation man errors, and it was noted with some interest and a manual error analysis from which turned out that where annotators make errors due to distrac- that the system, similarly to human annotators, has tion or misunderstanding, the model often classi- good performances in recognizing obligations but fies correctly. struggles in distinguish between relevant and not. As future works, we plan to involve domain-expert 7 Conclusions annotators to evaluate if their contribution can im- prove the quality of the data and of the model. In this work we propose a methodology for build- Also, we will explore techniques to provide more ing training corpora for obligations classification, context to the classifier in order to improve the per- based on annotations performed by non-experts. formances on clauses in which the subject is im- Gabriele Sarti. 2020. Umberto-mtsa@ accompl-it: plied. Improving complexity and acceptability prediction with multi-task learning on self-supervised annota- tions. arXiv preprint arXiv:2011.05197. References Amin Sleimi, Nicolas Sannier, Mehrdad Sabetzadeh, Roberto Bartolini, Alessandro Lenci, Simonetta Mon- Lionel Briand, and John Dann. 2018. Automated temagni, Vito Pirrelli, and Claudia Soria. 2004. extraction of semantic legal metadata using natural Automatic classification and analysis of provisions language processing. In 2018 IEEE 26th Interna- in italian legal texts: a case study. In OTM Con- tional Requirements Engineering Conference (RE), federated International Conferences” On the Move pages 124–135. IEEE. to Meaningful Internet Systems”, pages 593–604. Springer. Amin Sleimi, Marcello Ceci, Nicolas Sannier, Mehrdad Sabetzadeh, Lionel Briand, and John Carlo Biagioli, Enrico Francesconi, Andrea Passerini, Dann. 2019. A query system for extracting Simonetta Montemagni, and Claudia Soria. 2005. requirements-related information from legal texts. Automatic semantics extraction in law documents. In 2019 IEEE 27th International Requirements En- In Proceedings of the 10th international conference gineering Conference (RE), pages 319–329. IEEE. on Artificial intelligence and law, pages 133–140. Amin Sleimi, Marcello Ceci, Mehrdad Sabetzadeh, Andrea Cimino, Lorenzo De Mattei, and Felice Lionel C Briand, and John Dann. 2020. Auto- Dell’Orletta. 2018. Multi-task learning in deep mated recommendation of templates for legal re- neural networks at evalita 2018. Proceedings of quirements. In 2020 IEEE 28th International Re- the Wvaluation Campaign of Natural Language Pro- quirements Engineering Conference (RE), pages cessing and Speech tools for Italian, pages 86–95. 158–168. IEEE. Lorenzo De Mattei, Andrea Cimino, and Felice Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Dell’Orletta. 2018. Multi-task learning in deep neu- Ilya Sutskever, and Ruslan Salakhutdinov. 2014. ral network for sentiment polarity and irony classifi- Dropout: a simple way to prevent neural networks cation. In NL4AI@ AI* IA, pages 76–82. from overfitting. The journal of machine learning research, 15(1):1929–1958. Enrico Francesconi and Andrea Passerini. 2007. Auto- matic classification of provisions in legislative texts. Artificial Intelligence and Law, 15(1):1–17. Simone Giorgioni, Marcello Politi, Samir Salman, Roberto Basili, and Danilo Croce. 2020. Unitor@ sardistance2020: Combining transformer-based ar- chitectures and transfer learning for robust stance detection. In EVALITA. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining ap- proach. arXiv preprint arXiv:1907.11692. Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In ICML. Daniela Occhipinti, Andrea Tesei, Maria Iacono, Carlo Aliprandi, Lorenzo De Mattei, and Aptus AI. 2020. Italianlp@ tag-it: Umberto for author profiling at tag-it 2020. In Proceedings of Seventh Evalua- tion Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020), Online. CEUR. org. Monica Palmirani and Fabio Vitali, 2011. Akoma- Ntoso for Legal Documents, pages 75–100. Springer Netherlands, Dordrecht. Gerard Salton and Christopher Buckley. 1988. Term- weighting approaches in automatic text retrieval. In- formation Processing & Management, 24(5):513– 523.