The CRIKE Data-Science Process for Legal Knowledge Extraction Discussion Paper Silvana Castano1 , Mattia Falduti1 , Alfio Ferrara1 , and Stefano Montanelli1 Università degli Studi di Milano DI - Via Celoria, 18 - 20135 Milano {silvana.castano,mattia.falduti,alfio.ferrara,stefano.montanelli}@unimi.it Abstract. In this paper, we present CRIKE, a data-science approach to automatically detect concrete applications of legal abstract terms in case-law decisions. To this purpose, CRIKE relies on the use of the LATO ontology where legal abstract terms are properly formalized as concepts and relations among concepts. Using LATO, CRIKE aims at discovering how and where legal abstract terms are applied by judges in their legal argumentation. Moreover, we detect the terminology used in the text of case-law decisions to characterize concrete abstract-term instances. Keywords: legal ontology, legal-term extraction, case-law analysis 1 Introduction Law is general and abstract by definition. On the opposite, court case law deci- sions are specific and concrete, in that they provide a peculiar interpretation of law applied to the considered single cases. Legal interpreters, such as for example judges and lawyers, are daily involved in analysis and evaluation of court case law with the aim to extract/derive possible suggestions for incoming case appli- cations by relying on the experience of past applications that can be considered as a sort of consolidated legal knowledge. According to the Italian law, the legal terminology can be distinguished into three main categories, that are i) statutory terms, i.e., terms directly or indirectly defined by law; examples of statutory terms are public officer, illicit drug, and consumer; ii) descriptive terms, i.e., terms featuring actions, human activities, and any real-life object; examples of descriptive terms are escape, car, and year; iii) abstract terms, i.e., terms featuring something indeterminate that requires a concrete application for being really defined; examples of abstract terms are good faith, long-term cohabitation, and dangerous driving. Consider the abstract schema of a legal action provided in Figure 1. When a new case law is received for Copyright c 2019 for the individual papers by the papers authors. Copying permit- ted for private and academic purposes. This volume is published and copyrighted by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy. New case law to judge Courthouse judge Law History of Case-Law Decisions (CLDs) Decision/Verdict on new case law (concrete law interpretation) Fig. 1. The abstract schema of a legal action judgement, the expected evaluation process has to take into account i) the law, for understanding the terms, either statutory, descriptive, or abstract, that can be relevant for the current case, and ii) the history of case-law decisions, for detecting possible relevant interpretations and concrete applications of abstract terms that can be useful to support the decision/verdict to eventually deliver. In this paper, we present CRIKE (CRIme Knowledge Extraction), a data- science approach to detect concrete applications of legal abstract terms in large case-law decisions. To this purpose, CRIKE relies on the use of LATO (Legal Abstract Term Ontology) where legal terms are properly formalized as concepts and relations among concepts. Using LATO, CRIKE aims at discovering how and where legal abstract terms are applied by judges in their legal argumentation. The paper is organized as follows. In Section 2, the CRIKE approach is in- troduced. The LATO ontology and the CRIKE techniques for legal knowledge extraction are discussed in Section 3 and 4, respectively. Related work are dis- cussed in Section 5. Concluding remarks are provided in Section 6. 2 The CRIKE approach The CRIKE approach (see Figure 2) is conceived to support extraction of legal knowledge from a (possibly large) dataset of Case-Law Decisions (CLDs) com- ing from different, official sources, such as for example First Grade and Court of Appeal judgements. CRIKE embeds the LATO ontology where relevant law concepts of a given domain of interest are properly formalized. To enforce knowl- edge extraction, CRIKE exploits a given dataset of CLDs in input by adopting a conventional data-science process where each CLD is indexed and stored in a digital format. In particular, the CLDs of our dataset are acquired from the Court and the Court of Appeal of Milan and they are usually provided in im- age format with highly heterogeneous quality. The indexing and storage activity exploits data cleaning and tokenization techniques to obtain a pure textual ver- sion of each CLD as well as a focused set of metadata. By exploiting the indexed CLDs metadata, knowledge extraction is enforced with the aim at classifying a CLD with respect to the LATO ontology knowledge. In particular, extraction expert-based indexing/storage of ontology design CLDs Law Dataset of Case-Law Decisions (CLDs) legal knowledge extraction LATO ontology (abstract term specification) concrete applications of abstract terms Fig. 2. The CRIKE approach is focused on detecting the concrete applications of legal abstract terms in the text of the considered CLDs. The crucial idea of CRIKE is that the detection of a given abstract term AT is not only concerned with the recognition of single terms featuring AT , but also with the recognition of terms associated with the ancillary concepts related to AT , that we call abstract-term context. Motivating example. Consider the Italian law about drugs and related drug offenses, as reported in [11]. According to the Italian criminal order, “the Con- solidated Law, adopted by Presidential Decree No 309 on 9 October 1990 and subsequently amended, provides the legal framework for trade, treatment and prevention, and prohibition and punishment of illegal activities in the field of drugs and psychoactive substances. Drug use in itself is not mentioned as an offense. [...] The threshold between personal possession and trafficking is deter- mined by the circumstances of the specific case (e.g., the act, possession of tools for packaging, different types of drug possessed, number of doses in excess of av- erage daily use, means of organization). The penalty for supply-related offenses, such as production, sale, transport, distribution or acquisition, depends on the type of drug. However, when the offenses are considered minor because of the means, modalities or circumstances, the terms of imprisonment are lower. Evaluating whether or not the offense is minor should take into account a set of “ancillary” elements such as the mode of action, possible criminal motives, quality and quantity of drug possessed, the character of the offender, conduct during or subsequent to the offense, and the family and social conditions of the offender”. The notion of minor offense is an example of abstract term in the above law quotation. A precise definition of circumstances and related threshold quantities to associate with the notion of minor offense is not available/possible in the (abstract) law. Given a specific criminal charge of drug possession, the final decision/verdict is based on the specific interpretation of the abstract term “minor offense” where the specific circumstances and quantities of the considered case represent a concrete application of the corresponding abstract term. 3 Legal knowledge representation To formalize the knowledge related to abstract terms and their interpretation, we introduce LATO in CRIKE. LATO is a legal ontology where relevant law terms to exploit knowledge extraction in CLDs are defined; it contains concepts to represent general law terms, either abstract, statutory and descriptive terms. LATO is manually defined by domain experts and implemented according to the SKOS formalism. In particular, the concept hierarchy is based on a root concept Term with three main subconcepts, namely AbstractTerm, DescriptiveTerm, and StatutoryTerm (see Figure 3(a)). In addition to general law terms, the LATO ontology contains concepts that represent the Italian legislative structure, such as for example the concepts Law, LawArticle, and LawParagraph. Furthermore, the concepts Conviction and Discharge are also specified in LATO to represent the possible Court decisions (i.e., the verdict) of a given case law. In particular, the concept Conviction denotes a verdict in which the Court judges the defendant guilty, while the concept Discharge denotes a verdict in which the facts have a penalty relevance, but no punishment is finally delivered. Finally, the concepts Quantity and UnitOfMeasure are defined in LATO for allowing to represent the quantitative estimation of substances that can appear in legal documents. AbstractTerm is the core concept of the LATO ontology since it represents the target of the knowledge extraction functionalities of CRIKE. The related con- struct of SKOS is exploited to enrich the specification of an abstract term AT by formalizing the ontology relationships between AT and the other concepts of the LATO ontology composing its context. In particular, given a considered abstract term AT , related is used to connect AT to ancillary concepts of LATO representing i) an objective judgment element OBJ usually expressed through the connection of AT with a descriptive/statutory concept; ii) a subjective quan- titative evaluation SU BJ usually expressed through a relationship between AT and Quantity/UnitOfMeasure concepts; and iii) a legislative reference LREF usu- ally denoted with a connection of AT with a specific law or regulation (i.e., Law, LawArticle, and LawParagraph concepts). According to SKOS, each LATO con- cept is associated with a preferred label (prefLabel) as well as with one or more alternative labels (altLabel) and hidden labels (hiddenLabel) to enrich the concept definition with a label-set of literal descriptions that is very useful for subsequent knowledge extraction, to capture possible synonyms, acronyms, and abbrevia- tions in the text of CLDs. Example. An example of SKOS definition for the abstract term AT = MinorOf- fense is shown in Figure 3(b) according to the Italian drug-trafficking law. Mino- rOffense is related to the concepts Drug and DrugTraffickingVerb that represent the OBJ relationships since they are subconcepts of StatutoryTerm and DescriptiveT- erm, respectively. The relationships with the concepts Quantity and UnitOfMeasure represent the subjective judge evaluations SU BJ. The concepts Par5, Art73, and DPR309/1990 are subconcepts of the LawParagraph, LawArticle, and Law, respec- Term Law AbstractTerm DPR309/1990 MinorOffense PenalCode DrugMinorOffense PrivacyCode DescriptiveTerm Weapon DrugTraffickingVerb LawArticle StatutoryTerm Art648 LREF Drug Art73 Art73 DPR309/1990 Par5 Cannabis LawParagraph Cocaine Par3 Drug Quantity Heroin Par5 MinorOffence UnitOfMeasure Conviction Drug Unit Trafficking Discharge Measure Quantity Verbs OBJ SUBJ (a) (b) Fig. 3. (a) Excerpt of the LATO concept hierarchy; (b) Example of concept definition for the abstract term MinorOffense tively, and they express the legal references LREF of MinorOffense in the Italian criminal code where the drug trafficking crime is defined. 4 Knowledge extraction in CRIKE Knowledge extraction in CRIKE is based on the idea to exploit text analysis techniques for detecting the concrete applications of legal abstract terms be- longing to LATO throughout the stored/indexed case-law decisions CLDs. To this end, for a given abstract term AT , we introduce the notion of abstract-term context CtxAT containing, besides the AT term, all the concepts of LATO that are ancillary to AT , namely OBJ, SU BJ, or LREF concepts: CtxAT = {Ci | r(AT, Ci )} where r(AT, Ci ) denotes a SKOS related relationship between the abstract term AT and the concept Ci . For each concept C ∈ CtxAT , we define the concept label set LC that contains the whole set of labels, either preferred, alternative, or hidden, associated with C. Furthermore, based on the notion of LC , we define the extended label set LC where the concept label set of C is enriched by including the concept label set of the concepts Cj subsumed by C:  LC = LC ∪ LCj | Cj ⊆ C Consider the goal to detect the concrete applications of a certain abstract term AT in a dataset of case-law decisions CLDs. CRIKE knowledge extrac- tion is enforced by exploiting the extended label sets LC of the concepts in the context CtxAT . For each document d ∈ CLDs, we define a vector representa- tion d where each element corresponds to a concept in the context CtxAT . The value d[i] ∈ d is set to 1 when a label hit is detected, meaning that at least one occurrence of a label in LCi is found in d for the concept Ci ∈ CtxAT , and 0 otherwise (i.e., label miss). A threshold based mechanism is defined to specify the minimum number of label hits required to consider that a concrete application of the abstract term AT is detected in the document d. DPR 309/90 kg. art. 73 L. 309/1990 kilogram article73 public safety Kilo A. 73 Act Deal Kilo par. 5 Quantity deal DPR Par. five amount Art73 To deal 309/1990 paragr. V Weight cocaine Deal Crack Quantity Unit crystal Par5 OfMeasure Cocaine Drug Trafficking verbs Drug MinorOffense The police officers, during a routine inspection, discovered four bags of white powder onto the toilet floor (weighing 3.6 kg. in total). The powder tested positive for cocaine. […] the deal. Both males were arrested. For instance, considered matters, mode of action, quality and quantity of the substance, the penal relevance is clear, but the offence is minor. For the foregoing reasons, the court applies the art. 73, DPR 309/90, par. 5 […]. Fig. 4. Example of knowledge extraction for the abstract term MinorOffense Example. Consider the abstract term AT = MinorOffense and the correspond- ing context CtxMinorOffense = {Drug, DrugT raf f ickingV erb, DP R309/1990, Art73, P ar5, Quantity, U nitOf M easure}. Moreover, consider the extended la- bel set LDrug = LDrug ∪ {LCocaine , LHeroin , LCannabis }. In Figure 4, we show an example of knowledge extraction based on the concepts and corresponding extended label sets in the context CtxMinorOffense . An example of vector-based document representation for the abstract term AT = MinorOffense is shown in Figure 5. If we consider a threshold of 80% of label hits, we have that a concrete Drug DrugTraffickingVerb DPR309/90 Art73 Par5 Quantity UnitOfMeasure d1 1 0 1 1 1 1 1 d2 1 1 1 0 0 1 0 Fig. 5. Example of label hits for the abstract term MinorOffense application of MinorOffense is detected in document d1 since 6 hits are found over the available 7 concepts in CtxMinorOffense . 5 Related work Work related to the issues addressed in CRIKE regards legal argumentation mining and legal ontology design. Legal argumentation mining refers to the ca- pability to automatically detect and classify the role of possible argumentative units within a considered legal case text [1]. In [10], authors propose to mine statutory texts by using natural language processing and supervised machine learning techniques. More recently, the LUIMA approach has been proposed to focus on extraction of evidential reasoning from a court decision dataset [5]. Moreover, a particularly relevant contribution is provided in [9] about extrac- tion of case law sentences for argumentation of statutory terms. A survey on legal ontology design is presented in [1], where a special focus is given to representation of legal concepts in type systems. In [4], the notion of mutual consensus is introduced to support the specification of concepts and relations about contract formation. An application example based on a corpus of Italian legal texts is presented in [7], where the results of exploiting a learning system are provided. A further specification of a legal ontology using ONTOLIN- GUA is presented in [13]. Furthermore, in [12], authors present the LOIS project (Lexical Ontologies for Legal Information Sharing), and discuss a methodology for building a multilingual semantic lexicon for law able to be used both as a source of semantic metadata and as an external tool for cross lingual retrieval. On that topic, in [8], a methodology to automatically create an OWL ontology from a set of legal documents is presented. In [3], an automated approach based on statistical analysis is described, for identification of core concepts and rela- tions in a corpus of legal texts. Natural Language Processing (NLP) techniques are proposed in [6], to extract concepts and relations among legal concepts, with the aim to build an ontology for legal information retrieval. Original contribution of the proposed CRIKE approach is related to the enforcement of a data-science process with the support of an expert-based law ontology to extract knowledge from CLDs. A further peculiar feature of CRIKE is related to the formalization of an abstract term as a legal ontology con- cept with a corresponding context of related concepts. Ontology concepts with associated contexts are used to drive the identification of concrete applica- tions/interpretations of corresponding abstract terms in the text of CLDs. 6 CRIKE support to practices and concluding remarks In this paper, we presented the CRIKE approach for legal knowledge extrac- tion. We envisage the following main practices that can be supported by using CRIKE i) knowledge-assisted verdict writing, where the concrete termi- nology extracted for abstract terms can support the judge in the preparation of new case-law decisions; ii) history-based verdict prediction, where the knowledge extracted by CRIKE is used to train a machine learning mechanism with the aim to predict the possible decision on a new incoming case-law to judge; and iii) legal analytics, where the results of knowledge extraction are exploited to detect possible trends and common abstract-term interpretations. A preliminary experimentation of CRIKE has been performed based on a dataset provided by the Courthouse of Milan, Italy, whose results are described in [2]. The goal of the experimentation was to analyze the effectiveness of CRIKE in recognizing the concrete applications of the abstract term MinorOffense. Different research directions are currently being investigated. On the one side, we are working on a bootstrapping approach to enforce enrichment of the LATO ontology, so that the context of abstract terms can be progressively augmented with new relevant terms and literals as long as they are detected in CLDs during extraction. On the other side, machine learning techniques are being developed to enforce a supervised classification of CLDs based on abstract terms, by exploiting a training set of CLDs manually annotated by domain experts. References 1. Ashley, K.D.: Artificial Intelligence and Legal Analytics: New Tools for Law Prac- tice in the Digital Age. Cambridge University Press (2017) 2. Castano, S., Falduti, M., Ferrara, A., Montanelli, S.: Crime Knowledge Extrac- tion: An Ontology-Driven Approach for Detecting Abstract Terms in Case Law Decisions. In: Proc. of the 17th Int. Conf. on Artificial Intelligence and Law (2019) 3. Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D.: Integrating a Bot- tomUp and TopDown Methodology for Building Semantic Resources for the Mul- tilingual Legal Domain, vol. 6036, pp. 95–121. Springer (2010) 4. Gardner, A.: An Artificial Intelligence Approach to Legal Reasoning. MIT Press, Cambridge, MA, USA (1987) 5. Grabmair, M., Ashley, K.D., Chen, R., Sureshkumar, P., Wang, C., Nyberg, E., Walker, V.R.: Introducing LUIMA: an Experiment in Legal Conceptual Retrieval of Vaccine Injury Decisions Using a UIMA Type System and Tools. In: Proc. of the 15th Int. Conference on Artificial Intelligence and Law. pp. 69–78. ACM (2015) 6. Lame, G.: Using NLP Techniques to Identify Legal Ontology Components: Con- cepts and Relations, pp. 169–184. Springer Berlin Heidelberg (2005) 7. Lenci, A., Montemagni, S., Pirrelli, V., Venturi, G.: NLP-based Ontology Learning from Legal Texts. A Case Study. In: Proc. of the 2nd Workshop on Legal Ontologies and Artificial Intelligence Techniques. pp. 113–129. Citeseer (2007) 8. Saias, J., Quaresma, P.: A Methodology to Create Legal Ontologies in a Logic Programming Information Retrieval System, pp. 185–200. Springer (2005) 9. Savelka, J., Ashley, K.D.: Extracting Case Law Sentences for Argumentation about the Meaning of Statutory Terms. In: Proc. of the 3rd Int. Workshop on Argument Mining. pp. 50–59 (2016) 10. Savelka, J., Grabmair, M., Ashley, K.D.: Mining Information from Statutory Texts in Multi-Jurisdictional Settings. In: Proc. of the Int. Conference on Legal Knowl- edge and Information Systems. pp. 133–142. IOS Press (2014) 11. The European Monitoring Centre for Drugs and Drugs Addiction: Italy, Country Drug Report 2018. Tech. rep., The European Monitoring Centre for Drugs and Drugs Addiction (2018) 12. Tiscornia, D.: The LOIS project: Lexical Ontologies for Legal Information Sharing. In: Proc. of the V Legislative XML Workshop. pp. 189–204 (2006) 13. Visser, P., Bench-Capon, T.: The Formal Specification of a Legal Ontology. In: Proc. of the Int. Conference on Legal Knowledge and Information Systems (1996)