Finding Contextually Consistent Information Units in Legal Text Dominic Seyler∗ Paul Bruin ChengXiang Zhai dseyler2@illinois.edu Pavan Bayyapu czhai@illinois.edu University of Illinois at paul.bruin@regology.com University of Illinois at Urbana-Champaign pavan.bayyapu@regology.com Urbana-Champaign Regology ABSTRACT 8 9 8 T1 Terms in the laws of a legislature can be highly contextual: espe- cially for corpora of codified laws and regulations where the reader C1 8 C2 has to be aware of the correct context when the corpus lacks a sin- 5 5 3 3 8 9 gle level of hierarchy. The goal of this work is to assist professionals 8 when reading legal text within a codified corpus by finding contex- 9 SC1 SC2 SC3 SC4 tually consistent information units. To achieve this, we combine NLP and data mining techniques to develop novel methodology 8 9 that can find these information units in an unsupervised manner. Our method draws on expert experience and is modeled to emu- S1 - S15 S20 - S25 S30 - S35 S40 - S49 late the “contextualization process” of experienced readers of legal content. We experimentally evaluate our method by comparing it to multiple expert-annotated datasets and find that our method Figure 1: An example of a hierarchy-reference graph. Iden- achieves near perfect performance on four state corpora and high tified root contexts are highlighted with ★. precision on one federal corpus. KEYWORDS the reader needs to understand which other sections the definition information units, logical document organization, legal text mining applies to. For instance, Figure 1 shows Chapter 1 as root context. If the reader is reading Section 12 she might find the mention of the 1 INTRODUCTION word ‘person’. From our algorithm the reader is informed that the Within a corpus, the same term can have different meanings. For in- definition of ‘person’ is applicable to Chapter 1, which encompasses stance, in the United States Code (USC) 26 USC § 7701(a)(1) provides Section 1 through Section 25. that for purposes of Title 26 “The term ‘person’ shall be construed One challenge is to optimize the existing hierarchy within legal to mean and include an individual, a trust, (...) or corporation.” How- documents. For instance, the USC has 53 titles that are broken down ever, 42 USC § 2000e(a) provides that for purposes of the subchapter in lower levels of hierarchy such as chapters, parts, etc. We address it belongs to “The term ‘person’ includes one or more individuals, the aforementioned challenge by combining the natural language governments, (...) or receivers.” Thus, the context is not identical text and the existing document hierarchy to find higher-level log- across the corpus. ical groupings, which are not present in the original document This work’s goal is to inform professionals who read a section hierarchy (we call these groupings “root contexts”). To achieve this, of legal text about the continuous and contextually consistent in- our method makes use of NLP techniques to extract hierarchy refer- formation unit the section belongs to. We coin such a unit “root ences from the text and automatically builds an hierarchy-reference context” and it commonly represents an individual law on a specific graph (an example is depicted in Figure 1 with hierarchy references topic. Root context is used where a codified corpus’ hierarchical shown as arrows in the figure). We then create an algorithm that structure does not designate a single level for individual laws. Our follows the graph references to automatically identify a point in root context method is modeled to emulate the “contextualization the hierarchy that is a root context. process”, where experienced readers use references in the text to We evaluate our approach on five U.S. law corpora (one federal help contextualize what they are reading. The contextualization and four state corpora). To build our test datasets, we have a domain process can be broken down into three steps: 1) The reader no- expert annotate every root context in each of the corpora. In our tices that definition of ‘person’ is not in the current section. 2) The experiments, we compare these annotations with the predictions reader needs to find the definition of ‘person’ that is applicable of our algorithm and find that our method achieves near perfect to the current section. 3) Once the definition for ‘person’ is found performance on our state corpora and high precision on our federal ∗ This work was done during an internship at Regology. corpus. Summarizing, our work makes the following contributions: (1) We introduce and define the novel problem of root context Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons identification. License Attribution 4.0 International (CC BY 4.0). (2) We combine NLP and data mining techniques to develop NLLP @ KDD 2020, August 24th, San Diego, US novel methodology to identify root contexts. © 2020 Copyright held by the owner/author(s). (3) We show the effectiveness of our approach on five corpora. NLLP @ KDD 2020, August 24th, San Diego, US Dominic Seyler, Paul Bruin, Pavan Bayyapu, and ChengXiang Zhai 2 RELATED WORK are all directed hierarchy references and 𝑤 : 𝐸 → R the edge The identification of information units is common in library and in- weights. The goal is to find a set 𝑅 ⊆ 𝑉 with all hierarchy branches formation sciences [18] and web information retrieval [12, 16, 19]. (i.e., nodes in the graph) that are root contexts. There, the purpose is to find a set of documents or articles that form a logically and contextually coherent unit. In the legal domain, 4 APPROACH it is also common to create some sort of abstraction of legal text. Mostly this is achieved through information extraction, such as Algorithm 1: Root Context Identification text summarization [9, 11], argument mining [13, 17], named en- Data: 𝐶 , a document collection (i.e., a law corpus), 𝑉 a set of all hierarchy tity extraction [6, 7], citation resolution [15] and visualization [10]. branches Result: 𝑅 , a set of all root contexts with 𝑅 ⊆ 𝑉 Combining these information extraction techniques as building begin blocks, another line of work builds structured knowledge resources 𝐺 = (𝑉 , 𝐸, 𝑤) ← extract hierarchy references and build graph; 𝐷 ← identify hierarchy references in definition and purposes sections; with the intend of abstraction and logical organization of legal con- 𝑅 ← ∅; tent [8, 14]. Our work is bridging the gap between information units 𝑀𝐴𝑋 _𝐻𝑂𝑃𝑆 ← 10; in the sense of information retrieval, where we combine content for 𝑣 ∈ 𝑉 do ℎ𝑜𝑝𝑠 ← 0; on the logical level, and the legal domain. In contrast to other work 𝑐𝑢𝑟 _𝑣 ← 𝑣 ; in the legal domain, we leverage information units as a concept for while ℎ𝑜𝑝𝑠 < 𝑀𝐴𝑋 _𝐻𝑂𝑃𝑆 do ℎ𝑜𝑝𝑠 ← ℎ𝑜𝑝𝑠 + 1; guiding the user when navigating legal text. Rather than extracting if 𝑐𝑢𝑟 _𝑣 ∈ 𝐷 then content from an existing document collection, we augment the data 𝑅 ← 𝑅 ∪ 𝑐𝑢𝑟 _𝑣 ; with contextually consistent information units, which we call root break; 𝑐𝑜𝑢𝑛𝑡𝑠 ← ∅; contexts. Thus, root context is a novel application of the concept for (𝑥, 𝑦) ∀𝑥 = 𝑐𝑢𝑟 _𝑣 do of information units to organize large amounts of semi-structured, 𝑐𝑜𝑢𝑛𝑡 ← 𝑐𝑜𝑢𝑛𝑡 ∪ (𝑦, 𝑤 ( (𝑥, 𝑦))) ; legal text. if |𝑐𝑜𝑢𝑛𝑡 | > 0 then 𝑚𝑎𝑥 _𝑦 ← max𝑤 ( (𝑥,𝑦) ) 𝑐𝑜𝑢𝑛𝑡 ; if 𝑐𝑢𝑟 _𝑣 = 𝑚𝑎𝑥 _𝑦 then 3 PROBLEM DEFINITION 𝑅 ← 𝑅 ∪ 𝑐𝑢𝑟 _𝑣 ; break; Definition 3.1. Hierarchy instance: A concrete instantiation of a 𝑐𝑢𝑟 _𝑣 ← 𝑚𝑎𝑥 _𝑦 ; part of the hierarchy, e.g. “Title 12”, denoted as 𝑇12 . Each hierarchy else instance is the union of all its sub-parts. For example, 𝑇12 is made 𝑐𝑢𝑟 _𝑣 ← 𝑝𝑎𝑟𝑒𝑛𝑡 (𝑐𝑢𝑟 _𝑣) ; up of chapters 𝐶 1, ...𝐶 18 , which transitively are made up of sub- chapters 𝑆𝐶𝑖 , etc. Definition 3.2. Hierarchy level: The union of all hierarchy in- Our approach consists of three phases: 1) extract hierarchy ref- stances for a certain level. For instance, the “title” hierarchy level erences and build the hierarchy-reference graph 2) find root con- T of USC encompasses all 53 title instances 𝑇𝑖 : (𝑇1,𝑇2, ...,𝑇53 ) ∈ T . text indicators within definitions and purposes sections 3) perform The lowest level we consider for this work are Sections 𝑆, which multi-hops following the graph’s edges with the highest weight contain all the textual information. until a root context is identified. Algorithm 1 describes our method- Definition 3.3. Hierarchy branch: A unique identifier for a specific ology for identifying root contexts. The input to the algorithm is a part in the corpus hierarchy. The hierarchy branch is a tuple of document collection 𝐶 and a set of all hierarchy branches 𝑉 . The hierarchy instances that need to be “followed” from the hierarchy output is a set of root contexts 𝑅. root. For example, the tuple (‘USC’, ‘𝑇12 ’, ‘𝐶 2 ’, ‘𝑆 221 ’), identifies Build hierarchy-reference graph (𝐺 = (𝑉 , 𝐸, 𝑤)): We start with Section 221 in Chapter 2 (“Federal Reserve System”), within Title the nodes of the graph 𝑉 , which are all hierarchy branches in a legal 12 (“Banks and Banking”) in the United States Code. corpus. Because only the section hierarchy level S includes textual Definition 3.4. Hierarchy reference: The hierarchy branch of a context, we start investigating the text of all 𝑆𝑖 ∈ S. For each 𝑆𝑖 , we reference made to the hierarchy in legal text. For example, if the text extract all hierarchy references by looking for textual references in (‘USC’, ‘𝑇12 ’, ‘𝐶 2 ’, ‘𝑆 12 ’) references its chapter, then the resulting of the hierarchy. We use a set of regular expressions to find these hierarchy reference is: (‘USC’, ‘𝑇12 ’, ‘𝐶 2 ’). We define custom regular references in text, i.e., “this ()”. For instance, if expressions to identify hierarchy references in the text. the legal text in (‘USC’, ‘𝑇12 ’, ‘𝐶 2 ’, ‘𝑆 12 ’) says “this chapter” we Definition 3.5. Hierarchy-reference graph: A weighted, directed extract (‘USC’, ‘𝑇12 ’, ‘𝐶 2 ’). We then add an edge between all distinct graph where the nodes are hierarchy branches and the edges are hierarchy references and the current 𝑆𝑖 , where the edge weight is hierarchy references between them. The edge weight is the number the sum of the hierarchy references for the edge’s target node. Once of references between a pair of nodes. this step is completed for S, we begin a count aggregation step at Definition 3.6. Root context: A specific hierarchy instance, which each hierarchy level, moving “upwards” in the hierarchy. In our makes up a contextually consistent information unit. running example, we move from S to the chapter hierarchy level Definition 3.7. Root context identification: Given a document col- C. For each 𝐶𝑖 ∈ C, we aggregate all hierarchy references that go lection 𝐶 (i.e., a law corpus), extract all hierarchy references 𝐸 from from 𝐶𝑖 to other parts in the hierarchy. For instance, we aggregate the text and combine them with the existing document hierarchy all hierarchy references to 𝑇12 within the hierarchy instance of 𝐶 2 , 𝑉 to form a hierarchy-reference graph 𝐺 = (𝑉 , 𝐸, 𝑤). More specifi- as a single edge from (‘USC’, ‘𝑇12 ’, ‘𝐶 2 ’) to (‘USC’, ‘𝑇12 ’), with the cally, 𝑉 is the set of all hierarchy branches, 𝐸 ⊆ {(𝑥, 𝑦)|(𝑥, 𝑦) ∈ 𝑉 2 } combined edge weight of all its children that are pointing to 𝑇12 . Finding Contextually Consistent Information Units in Legal Text NLLP @ KDD 2020, August 24th, San Diego, US Target Regular Expression Dataset Total Number Root Contexts % definitions this (\w+) (?:describes|sets forth|governs|contains) USC 166,086 3,040 1.83 definitions the following definitions apply to this (\w+) CACL 177,862 2,887 1.62 purposes the purposes? of (?:the )?.*this (\w+) (?:are|include|is) TXST 239,259 4,419 1.84 purposes this (\w+) sets forth ILCS 19,088 800 4.19 Table 1: Regexes for finding definitions/purposes sections. NYCL 4,201 306 7.28 Table 2: Statistics of datasets. Dataset 𝐹1 Precision Recall Accuracy Identify root context indicators within definitions and pur- USC 0.71 0.95 0.56 0.99 poses sections (𝐷): References in certain sections that contain CACL 0.97 0.98 0.97 1.00 definitions or purposes carry more weight than regular sections, as TXST 0.95 0.98 0.91 1.00 these were authored with the intend of helping the reader with con- ILCS 0.98 0.99 0.97 0.99 textualization. Our extraction algorithm goes through all sections NYCL 0.92 1.00 0.85 0.99 that contain either “definition” or “purpose” in their title. We define Table 3: Classification results of root context identification. a set of regular expressions that extract the hierarchy references in these sections. Table 1 shows some examples, where “(\w+)” 5.2 Experimental Results matches a reference to the hierarchy (e.g., “chapter”). If one of the Table 3 shows the results of our method on different datasets. We regular expressions matches the text within a section, the hierarchy find that our method generally achieves high precision (≥ 0.95), branch of the extracted reference is added to a set 𝐷. which means that if our method finds a root context one can be Traverse hierarchy-reference graph to generate 𝑅: Starting at certain that it actually is a root context. Recall is also high for every node in the graph, we perform multiple hops (in our case state corpora, but lesser so for our federal corpus, meaning that 10) along the graph, following always the outgoing edge with the our method finds almost all root contexts on a state level. In our highest weight. If a node is found that is in 𝐷, it will automatically manual analysis we find that the reason for the lower recall on be added as a root context1 . If a node’s highest weighted outgoing the federal level is that there are inherent differences in how the edge points to itself, then we also consider the node a root context. hierarchy is organized within the titles of USC. We see improving If a node has no outgoing edges, we move up one level in the the recall on the federal level as an opportunity for future research. hierarchy, to the node’s parent, and continue the procedure. Accuracy is close to 1 for all corpora, which is not surprising since the number of root contexts is smaller than the number of nodes 5 EXPERIMENTS in the graph. Summarizing, we find that our method achieves near perfect performance for state corpora and high precision for USC. 5.1 Experimental Setup Our experimental evaluation aims at measuring the efficacy of our 6 CONCLUSION AND IMPACT root context identification approach outlined in Section 4. For our We presented the problem of finding contextually consistent infor- experiments, we run our algorithm and compare its predictions to mation units to assist professionals when reading legal text and the expert annotations. We experiment with three full law corpora: developed novel methodology to find these units. We evaluate our United States Code (USC) [5], California Law (CACL) [1], Texas method and find that it achieves high precision and 𝐹 1 score on mul- Statutes (TXST) [4] and two partial law corpora: Illinois Compiled tiple datasets. The high accuracy of our method indicates that the Statutes (ILCS) [3] (10 out of 68 chapters covered) and Consolidated task is not hard and the proposed method has worked well. Since Laws of New York (NYCL) [2] (7 out of 92 chapters covered). For the method is unsupervised, it does not require any manual work each corpus, we extract all of its textual contents and hierarchy and can thus be applied broadly to all such application problems. from the web-pages available online. Naturally, we may further improve performance by applying super- For each hierarchy branch in 𝑉 , we have a domain expert anno- vised learning with our method used as one feature and combine it tate whether it is a root context (“1” label) or not (“0” label). Table 2 with other features, especially text-based features, which would be shows the statistics for each dataset, with its total number of nodes future work. While we find our method to be effective, there are (“Total”), the number of annotated root contexts (“Number Root instances where root contexts cannot be identified due to the lack Contexts”) and the percentage of root contexts out of all nodes in of hierarchy references. To solve this in the future, we envision to the graph (“%”). find root context indicators by looking at the evolution of legal text Since our experimental setup is equivalent to a classification over time and study the differences of different versions of the law. problem, we report 𝐹 1 -score, precision, recall of the class under This work further aids Regology’s2 machine learning framework. consideration (i.e., “1” label) and classification accuracy. In this For instance, the root context concept has been successfully utilized case, false positives are instances that our method identifies as root to build a keyword extraction algorithm; increase the performance contexts, however, the expert annotated these instances as not a of existing information retrieval components (e.g., grouping search root context. False negatives are root contexts that were identified results, assessing law changes, differentiating between new and by the expert but not found by our method. amending laws), and extract topics. 1 We experimented with different weighting schemes for nodes that are in 𝐷 , however, we found that “overriding” the edge weights in 𝐷 works best. 2 https://regology.com NLLP @ KDD 2020, August 24th, San Diego, US Dominic Seyler, Paul Bruin, Pavan Bayyapu, and ChengXiang Zhai REFERENCES [11] Ambedkar Kanapala, Sukomal Pal, and Rajendra Pamula. 2019. Text summariza- [1] [n.d.]. California Law. http://leginfo.legislature.ca.gov/faces/codes.xhtml tion from legal documents: a survey. Artificial Intelligence Review 51, 3 (2019), [2] [n.d.]. Consolidated Laws of New York. https://www.nysenate.gov/legislation/ 371–402. laws/CONSOLIDATED [12] Wen-Syan Li, K Selçuk Candan, Quoc Vu, and Divyakant Agrawal. 2001. Retriev- [3] [n.d.]. Illinois Compiled Statutes. http://www.ilga.gov/legislation/ilcs/ilcs.asp ing and organizing web pages by “information unit”. In Proceedings of the 10th [4] [n.d.]. Texas Statutes. https://statutes.capitol.texas.gov/StatuteCodes.aspx international conference on World Wide Web. 230–244. [5] [n.d.]. United States Code. http://uscode.house.gov/download/download.shtml [13] Marie-Francine Moens, Erik Boiy, Raquel Mochales Palau, and Chris Reed. 2007. [6] Cristian Cardellino, Milagro Teruel, Laura Alonso Alemany, and Serena Villata. Automatic detection of arguments in legal texts. In Proceedings of the 11th inter- 2017. A low-cost, high-coverage legal named entity recognizer, classifier and national conference on Artificial intelligence and law. 225–230. linker. In Proceedings of the 16th edition of the International Conference on Articial [14] Georg Rehm, Julian Moreno Schneider, Jorge Gracia, Artem Revenko, Victor Intelligence and Law. 9–18. Mireles, Maria Khvalchik, Ilan Kernerman, Andis Lagzdins, Mārcis Pinnis, Artus [7] Christopher Dozier, Ravikumar Kondadadi, Marc Light, Arun Vachher, Sriharsha Vasilevskis, et al. 2019. Developing and orchestrating a portfolio of natural Veeramachaneni, and Ramdev Wudali. 2010. Named entity recognition and legal language processing and document curation services. In Proceedings of the resolution in legal text. In Semantic Processing of Legal Texts. Springer, 27–43. Natural Legal Language Processing Workshop 2019. 55–66. [8] Enrico Francesconi, Simonetta Montemagni, Wim Peters, and Daniela Tiscornia. [15] Robert Shaffer and Stephen Mayhew. 2019. Legal Linking: Citation Resolution and 2010. Integrating a bottom–up and top–down methodology for building semantic Suggestion in Constitutional Law. In Proceedings of the Natural Legal Language resources for the multilingual legal domain. In Semantic Processing of Legal Texts. Processing Workshop 2019. 39–44. Springer, 95–121. [16] Keishi Tajima, Kenji Hatano, Takeshi Matsukura, Ryoichi Sano, and Katsumi [9] Filippo Galgani, Paul Compton, and Achim Hoffmann. 2012. Combining differ- Tanaka. 1999. Discovery and Retrieval of Logical Information Units in Web.. In ent summarization techniques for legal text. In Proceedings of the workshop on WOWS. 13–23. innovative hybrid approaches to the processing of textual data. Association for [17] Adam Wyner, Raquel Mochales-Palau, Marie-Francine Moens, and David Milward. Computational Linguistics, 115–123. 2010. Approaches to text mining arguments from legal cases. In Semantic [10] Aikaterini-Lida Kalouli, Leo Vrana, Vigile Marie Fabella, Luna Bellani, and An- processing of legal texts. Springer, 60–79. nette Hautli-Janisz. 2018. CoUSBi: A Structured and Visualized Legal Corpus of [18] Liangzhi Yu, Zhenjia Fan, and Anyi Li. 2019. A hierarchical typology of scholarly US State Bills. In Proceedings of the LREC 2018 “Workshop on Language Resources information units: based on a deduction-verification study. Journal of Documen- and Technologies for the Legal Knowledge Graph”, Miyazaki, Japan. tation (2019). [19] ChengXiang Zhai and John Lafferty. 2006. A risk minimization framework for information retrieval. Information Processing & Management 42, 1 (2006), 31–55.