Finding Contextually Consistent Information Units in Legal Text
                 Dominic Seyler∗                                             Paul Bruin                                          ChengXiang Zhai
              dseyler2@illinois.edu                                        Pavan Bayyapu                                       czhai@illinois.edu
              University of Illinois at                             paul.bruin@regology.com                                  University of Illinois at
               Urbana-Champaign                                   pavan.bayyapu@regology.com                                  Urbana-Champaign
                                                                            Regology

ABSTRACT                                                                                                                 8                    9
                                                                                                        8                            T1
Terms in the laws of a legislature can be highly contextual: espe-
cially for corpora of codified laws and regulations where the reader                                            C1               8                C2
has to be aware of the correct context when the corpus lacks a sin-
                                                                                                    5       5        3       3                8        9
gle level of hierarchy. The goal of this work is to assist professionals
                                                                                                                                     8
when reading legal text within a codified corpus by finding contex-                                                                                            9
                                                                                                        SC1          SC2                  SC3          SC4
tually consistent information units. To achieve this, we combine
NLP and data mining techniques to develop novel methodology                                                                               8                9
that can find these information units in an unsupervised manner.
Our method draws on expert experience and is modeled to emu-                                        S1 - S15     S20 - S25            S30 - S35 S40 - S49
late the “contextualization process” of experienced readers of legal
content. We experimentally evaluate our method by comparing
it to multiple expert-annotated datasets and find that our method                      Figure 1: An example of a hierarchy-reference graph. Iden-
achieves near perfect performance on four state corpora and high                       tified root contexts are highlighted with ★.
precision on one federal corpus.

KEYWORDS
                                                                                       the reader needs to understand which other sections the definition
information units, logical document organization, legal text mining                    applies to. For instance, Figure 1 shows Chapter 1 as root context.
                                                                                       If the reader is reading Section 12 she might find the mention of the
1    INTRODUCTION                                                                      word ‘person’. From our algorithm the reader is informed that the
Within a corpus, the same term can have different meanings. For in-                    definition of ‘person’ is applicable to Chapter 1, which encompasses
stance, in the United States Code (USC) 26 USC § 7701(a)(1) provides                   Section 1 through Section 25.
that for purposes of Title 26 “The term ‘person’ shall be construed                        One challenge is to optimize the existing hierarchy within legal
to mean and include an individual, a trust, (...) or corporation.” How-                documents. For instance, the USC has 53 titles that are broken down
ever, 42 USC § 2000e(a) provides that for purposes of the subchapter                   in lower levels of hierarchy such as chapters, parts, etc. We address
it belongs to “The term ‘person’ includes one or more individuals,                     the aforementioned challenge by combining the natural language
governments, (...) or receivers.” Thus, the context is not identical                   text and the existing document hierarchy to find higher-level log-
across the corpus.                                                                     ical groupings, which are not present in the original document
    This work’s goal is to inform professionals who read a section                     hierarchy (we call these groupings “root contexts”). To achieve this,
of legal text about the continuous and contextually consistent in-                     our method makes use of NLP techniques to extract hierarchy refer-
formation unit the section belongs to. We coin such a unit “root                       ences from the text and automatically builds an hierarchy-reference
context” and it commonly represents an individual law on a specific                    graph (an example is depicted in Figure 1 with hierarchy references
topic. Root context is used where a codified corpus’ hierarchical                      shown as arrows in the figure). We then create an algorithm that
structure does not designate a single level for individual laws. Our                   follows the graph references to automatically identify a point in
root context method is modeled to emulate the “contextualization                       the hierarchy that is a root context.
process”, where experienced readers use references in the text to                          We evaluate our approach on five U.S. law corpora (one federal
help contextualize what they are reading. The contextualization                        and four state corpora). To build our test datasets, we have a domain
process can be broken down into three steps: 1) The reader no-                         expert annotate every root context in each of the corpora. In our
tices that definition of ‘person’ is not in the current section. 2) The                experiments, we compare these annotations with the predictions
reader needs to find the definition of ‘person’ that is applicable                     of our algorithm and find that our method achieves near perfect
to the current section. 3) Once the definition for ‘person’ is found                   performance on our state corpora and high precision on our federal
∗ This work was done during an internship at Regology.
                                                                                       corpus. Summarizing, our work makes the following contributions:
                                                                                           (1) We introduce and define the novel problem of root context
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons           identification.
License Attribution 4.0 International (CC BY 4.0).                                         (2) We combine NLP and data mining techniques to develop
NLLP @ KDD 2020, August 24th, San Diego, US
                                                                                               novel methodology to identify root contexts.
© 2020 Copyright held by the owner/author(s).
                                                                                           (3) We show the effectiveness of our approach on five corpora.
NLLP @ KDD 2020, August 24th, San Diego, US                                                       Dominic Seyler, Paul Bruin, Pavan Bayyapu, and ChengXiang Zhai


2    RELATED WORK                                                                are all directed hierarchy references and 𝑤 : 𝐸 → R the edge
The identification of information units is common in library and in-             weights. The goal is to find a set 𝑅 ⊆ 𝑉 with all hierarchy branches
formation sciences [18] and web information retrieval [12, 16, 19].              (i.e., nodes in the graph) that are root contexts.
There, the purpose is to find a set of documents or articles that
form a logically and contextually coherent unit. In the legal domain,            4     APPROACH
it is also common to create some sort of abstraction of legal text.
Mostly this is achieved through information extraction, such as                   Algorithm 1: Root Context Identification
text summarization [9, 11], argument mining [13, 17], named en-                      Data: 𝐶 , a document collection (i.e., a law corpus), 𝑉 a set of all hierarchy
tity extraction [6, 7], citation resolution [15] and visualization [10].                   branches
                                                                                     Result: 𝑅 , a set of all root contexts with 𝑅 ⊆ 𝑉
Combining these information extraction techniques as building                        begin
blocks, another line of work builds structured knowledge resources                       𝐺 = (𝑉 , 𝐸, 𝑤) ← extract hierarchy references and build graph;
                                                                                         𝐷 ← identify hierarchy references in definition and purposes sections;
with the intend of abstraction and logical organization of legal con-                    𝑅 ← ∅;
tent [8, 14]. Our work is bridging the gap between information units                     𝑀𝐴𝑋 _𝐻𝑂𝑃𝑆 ← 10;
in the sense of information retrieval, where we combine content                          for 𝑣 ∈ 𝑉 do
                                                                                               ℎ𝑜𝑝𝑠 ← 0;
on the logical level, and the legal domain. In contrast to other work                          𝑐𝑢𝑟 _𝑣 ← 𝑣 ;
in the legal domain, we leverage information units as a concept for                            while ℎ𝑜𝑝𝑠 < 𝑀𝐴𝑋 _𝐻𝑂𝑃𝑆 do
                                                                                                     ℎ𝑜𝑝𝑠 ← ℎ𝑜𝑝𝑠 + 1;
guiding the user when navigating legal text. Rather than extracting                                  if 𝑐𝑢𝑟 _𝑣 ∈ 𝐷 then
content from an existing document collection, we augment the data                                         𝑅 ← 𝑅 ∪ 𝑐𝑢𝑟 _𝑣 ;
with contextually consistent information units, which we call root                                        break;
                                                                                                     𝑐𝑜𝑢𝑛𝑡𝑠 ← ∅;
contexts. Thus, root context is a novel application of the concept                                   for (𝑥, 𝑦) ∀𝑥 = 𝑐𝑢𝑟 _𝑣 do
of information units to organize large amounts of semi-structured,                                        𝑐𝑜𝑢𝑛𝑡 ← 𝑐𝑜𝑢𝑛𝑡 ∪ (𝑦, 𝑤 ( (𝑥, 𝑦))) ;
legal text.                                                                                         if |𝑐𝑜𝑢𝑛𝑡 | > 0 then
                                                                                                         𝑚𝑎𝑥 _𝑦 ← max𝑤 ( (𝑥,𝑦) ) 𝑐𝑜𝑢𝑛𝑡 ;
                                                                                                         if 𝑐𝑢𝑟 _𝑣 = 𝑚𝑎𝑥 _𝑦 then
3    PROBLEM DEFINITION                                                                                       𝑅 ← 𝑅 ∪ 𝑐𝑢𝑟 _𝑣 ;
                                                                                                              break;
Definition 3.1. Hierarchy instance: A concrete instantiation of a                                        𝑐𝑢𝑟 _𝑣 ← 𝑚𝑎𝑥 _𝑦 ;
part of the hierarchy, e.g. “Title 12”, denoted as 𝑇12 . Each hierarchy                             else
instance is the union of all its sub-parts. For example, 𝑇12 is made                                     𝑐𝑢𝑟 _𝑣 ← 𝑝𝑎𝑟𝑒𝑛𝑡 (𝑐𝑢𝑟 _𝑣) ;
up of chapters 𝐶 1, ...𝐶 18 , which transitively are made up of sub-
chapters 𝑆𝐶𝑖 , etc.
Definition 3.2. Hierarchy level: The union of all hierarchy in-                      Our approach consists of three phases: 1) extract hierarchy ref-
stances for a certain level. For instance, the “title” hierarchy level           erences and build the hierarchy-reference graph 2) find root con-
T of USC encompasses all 53 title instances 𝑇𝑖 : (𝑇1,𝑇2, ...,𝑇53 ) ∈ T .         text indicators within definitions and purposes sections 3) perform
The lowest level we consider for this work are Sections 𝑆, which                 multi-hops following the graph’s edges with the highest weight
contain all the textual information.                                             until a root context is identified. Algorithm 1 describes our method-
Definition 3.3. Hierarchy branch: A unique identifier for a specific             ology for identifying root contexts. The input to the algorithm is a
part in the corpus hierarchy. The hierarchy branch is a tuple of                 document collection 𝐶 and a set of all hierarchy branches 𝑉 . The
hierarchy instances that need to be “followed” from the hierarchy                output is a set of root contexts 𝑅.
root. For example, the tuple (‘USC’, ‘𝑇12 ’, ‘𝐶 2 ’, ‘𝑆 221 ’), identifies       Build hierarchy-reference graph (𝐺 = (𝑉 , 𝐸, 𝑤)): We start with
Section 221 in Chapter 2 (“Federal Reserve System”), within Title                the nodes of the graph 𝑉 , which are all hierarchy branches in a legal
12 (“Banks and Banking”) in the United States Code.                              corpus. Because only the section hierarchy level S includes textual
Definition 3.4. Hierarchy reference: The hierarchy branch of a                   context, we start investigating the text of all 𝑆𝑖 ∈ S. For each 𝑆𝑖 , we
reference made to the hierarchy in legal text. For example, if the text          extract all hierarchy references by looking for textual references
in (‘USC’, ‘𝑇12 ’, ‘𝐶 2 ’, ‘𝑆 12 ’) references its chapter, then the resulting   of the hierarchy. We use a set of regular expressions to find these
hierarchy reference is: (‘USC’, ‘𝑇12 ’, ‘𝐶 2 ’). We define custom regular        references in text, i.e., “this (<hierarchy level>)”. For instance, if
expressions to identify hierarchy references in the text.                        the legal text in (‘USC’, ‘𝑇12 ’, ‘𝐶 2 ’, ‘𝑆 12 ’) says “this chapter” we
Definition 3.5. Hierarchy-reference graph: A weighted, directed                  extract (‘USC’, ‘𝑇12 ’, ‘𝐶 2 ’). We then add an edge between all distinct
graph where the nodes are hierarchy branches and the edges are                   hierarchy references and the current 𝑆𝑖 , where the edge weight is
hierarchy references between them. The edge weight is the number                 the sum of the hierarchy references for the edge’s target node. Once
of references between a pair of nodes.                                           this step is completed for S, we begin a count aggregation step at
Definition 3.6. Root context: A specific hierarchy instance, which               each hierarchy level, moving “upwards” in the hierarchy. In our
makes up a contextually consistent information unit.                             running example, we move from S to the chapter hierarchy level
Definition 3.7. Root context identification: Given a document col-               C. For each 𝐶𝑖 ∈ C, we aggregate all hierarchy references that go
lection 𝐶 (i.e., a law corpus), extract all hierarchy references 𝐸 from          from 𝐶𝑖 to other parts in the hierarchy. For instance, we aggregate
the text and combine them with the existing document hierarchy                   all hierarchy references to 𝑇12 within the hierarchy instance of 𝐶 2 ,
𝑉 to form a hierarchy-reference graph 𝐺 = (𝑉 , 𝐸, 𝑤). More specifi-              as a single edge from (‘USC’, ‘𝑇12 ’, ‘𝐶 2 ’) to (‘USC’, ‘𝑇12 ’), with the
cally, 𝑉 is the set of all hierarchy branches, 𝐸 ⊆ {(𝑥, 𝑦)|(𝑥, 𝑦) ∈ 𝑉 2 }        combined edge weight of all its children that are pointing to 𝑇12 .
Finding Contextually Consistent Information Units in Legal Text                                                             NLLP @ KDD 2020, August 24th, San Diego, US


      Target       Regular Expression                                                                Dataset      Total      Number Root Contexts       %
    definitions    this (\w+) (?:describes|sets forth|governs|contains)                               USC        166,086            3,040              1.83
    definitions    the following definitions apply to this (\w+)                                     CACL        177,862            2,887              1.62
     purposes      the purposes? of (?:the )?.*this (\w+) (?:are|include|is)                          TXST       239,259            4,419              1.84
     purposes      this (\w+) sets forth                                                              ILCS       19,088              800               4.19
Table 1: Regexes for finding definitions/purposes sections.                                          NYCL         4,201              306               7.28
                                                                                                             Table 2: Statistics of datasets.

                                                                                                      Dataset      𝐹1      Precision   Recall   Accuracy
Identify root context indicators within definitions and pur-                                           USC        0.71       0.95       0.56      0.99
poses sections (𝐷): References in certain sections that contain                                       CACL        0.97       0.98       0.97      1.00
definitions or purposes carry more weight than regular sections, as                                    TXST       0.95       0.98       0.91      1.00
these were authored with the intend of helping the reader with con-                                    ILCS       0.98       0.99       0.97      0.99
textualization. Our extraction algorithm goes through all sections                                    NYCL        0.92       1.00       0.85      0.99
that contain either “definition” or “purpose” in their title. We define                 Table 3: Classification results of root context identification.
a set of regular expressions that extract the hierarchy references
in these sections. Table 1 shows some examples, where “(\w+)”                           5.2     Experimental Results
matches a reference to the hierarchy (e.g., “chapter”). If one of the                   Table 3 shows the results of our method on different datasets. We
regular expressions matches the text within a section, the hierarchy                    find that our method generally achieves high precision (≥ 0.95),
branch of the extracted reference is added to a set 𝐷.                                  which means that if our method finds a root context one can be
Traverse hierarchy-reference graph to generate 𝑅: Starting at                           certain that it actually is a root context. Recall is also high for
every node in the graph, we perform multiple hops (in our case                          state corpora, but lesser so for our federal corpus, meaning that
10) along the graph, following always the outgoing edge with the                        our method finds almost all root contexts on a state level. In our
highest weight. If a node is found that is in 𝐷, it will automatically                  manual analysis we find that the reason for the lower recall on
be added as a root context1 . If a node’s highest weighted outgoing                     the federal level is that there are inherent differences in how the
edge points to itself, then we also consider the node a root context.                   hierarchy is organized within the titles of USC. We see improving
If a node has no outgoing edges, we move up one level in the                            the recall on the federal level as an opportunity for future research.
hierarchy, to the node’s parent, and continue the procedure.                            Accuracy is close to 1 for all corpora, which is not surprising since
                                                                                        the number of root contexts is smaller than the number of nodes
5 EXPERIMENTS                                                                           in the graph. Summarizing, we find that our method achieves near
                                                                                        perfect performance for state corpora and high precision for USC.
5.1 Experimental Setup
Our experimental evaluation aims at measuring the efficacy of our                       6     CONCLUSION AND IMPACT
root context identification approach outlined in Section 4. For our
                                                                                        We presented the problem of finding contextually consistent infor-
experiments, we run our algorithm and compare its predictions to
                                                                                        mation units to assist professionals when reading legal text and
the expert annotations. We experiment with three full law corpora:
                                                                                        developed novel methodology to find these units. We evaluate our
United States Code (USC) [5], California Law (CACL) [1], Texas
                                                                                        method and find that it achieves high precision and 𝐹 1 score on mul-
Statutes (TXST) [4] and two partial law corpora: Illinois Compiled
                                                                                        tiple datasets. The high accuracy of our method indicates that the
Statutes (ILCS) [3] (10 out of 68 chapters covered) and Consolidated
                                                                                        task is not hard and the proposed method has worked well. Since
Laws of New York (NYCL) [2] (7 out of 92 chapters covered). For
                                                                                        the method is unsupervised, it does not require any manual work
each corpus, we extract all of its textual contents and hierarchy
                                                                                        and can thus be applied broadly to all such application problems.
from the web-pages available online.
                                                                                        Naturally, we may further improve performance by applying super-
   For each hierarchy branch in 𝑉 , we have a domain expert anno-
                                                                                        vised learning with our method used as one feature and combine it
tate whether it is a root context (“1” label) or not (“0” label). Table 2
                                                                                        with other features, especially text-based features, which would be
shows the statistics for each dataset, with its total number of nodes
                                                                                        future work. While we find our method to be effective, there are
(“Total”), the number of annotated root contexts (“Number Root
                                                                                        instances where root contexts cannot be identified due to the lack
Contexts”) and the percentage of root contexts out of all nodes in
                                                                                        of hierarchy references. To solve this in the future, we envision to
the graph (“%”).
                                                                                        find root context indicators by looking at the evolution of legal text
   Since our experimental setup is equivalent to a classification
                                                                                        over time and study the differences of different versions of the law.
problem, we report 𝐹 1 -score, precision, recall of the class under
                                                                                           This work further aids Regology’s2 machine learning framework.
consideration (i.e., “1” label) and classification accuracy. In this
                                                                                        For instance, the root context concept has been successfully utilized
case, false positives are instances that our method identifies as root
                                                                                        to build a keyword extraction algorithm; increase the performance
contexts, however, the expert annotated these instances as not a
                                                                                        of existing information retrieval components (e.g., grouping search
root context. False negatives are root contexts that were identified
                                                                                        results, assessing law changes, differentiating between new and
by the expert but not found by our method.
                                                                                        amending laws), and extract topics.

1 We experimented with different weighting schemes for nodes that are in 𝐷 , however,
we found that “overriding” the edge weights in 𝐷 works best.                            2 https://regology.com
NLLP @ KDD 2020, August 24th, San Diego, US                                                                    Dominic Seyler, Paul Bruin, Pavan Bayyapu, and ChengXiang Zhai


REFERENCES                                                                                    [11] Ambedkar Kanapala, Sukomal Pal, and Rajendra Pamula. 2019. Text summariza-
 [1] [n.d.]. California Law. http://leginfo.legislature.ca.gov/faces/codes.xhtml                   tion from legal documents: a survey. Artificial Intelligence Review 51, 3 (2019),
 [2] [n.d.]. Consolidated Laws of New York. https://www.nysenate.gov/legislation/                  371–402.
     laws/CONSOLIDATED                                                                        [12] Wen-Syan Li, K Selçuk Candan, Quoc Vu, and Divyakant Agrawal. 2001. Retriev-
 [3] [n.d.]. Illinois Compiled Statutes. http://www.ilga.gov/legislation/ilcs/ilcs.asp             ing and organizing web pages by “information unit”. In Proceedings of the 10th
 [4] [n.d.]. Texas Statutes. https://statutes.capitol.texas.gov/StatuteCodes.aspx                  international conference on World Wide Web. 230–244.
 [5] [n.d.]. United States Code. http://uscode.house.gov/download/download.shtml              [13] Marie-Francine Moens, Erik Boiy, Raquel Mochales Palau, and Chris Reed. 2007.
 [6] Cristian Cardellino, Milagro Teruel, Laura Alonso Alemany, and Serena Villata.                Automatic detection of arguments in legal texts. In Proceedings of the 11th inter-
     2017. A low-cost, high-coverage legal named entity recognizer, classifier and                 national conference on Artificial intelligence and law. 225–230.
     linker. In Proceedings of the 16th edition of the International Conference on Articial   [14] Georg Rehm, Julian Moreno Schneider, Jorge Gracia, Artem Revenko, Victor
     Intelligence and Law. 9–18.                                                                   Mireles, Maria Khvalchik, Ilan Kernerman, Andis Lagzdins, Mārcis Pinnis, Artus
 [7] Christopher Dozier, Ravikumar Kondadadi, Marc Light, Arun Vachher, Sriharsha                  Vasilevskis, et al. 2019. Developing and orchestrating a portfolio of natural
     Veeramachaneni, and Ramdev Wudali. 2010. Named entity recognition and                         legal language processing and document curation services. In Proceedings of the
     resolution in legal text. In Semantic Processing of Legal Texts. Springer, 27–43.             Natural Legal Language Processing Workshop 2019. 55–66.
 [8] Enrico Francesconi, Simonetta Montemagni, Wim Peters, and Daniela Tiscornia.             [15] Robert Shaffer and Stephen Mayhew. 2019. Legal Linking: Citation Resolution and
     2010. Integrating a bottom–up and top–down methodology for building semantic                  Suggestion in Constitutional Law. In Proceedings of the Natural Legal Language
     resources for the multilingual legal domain. In Semantic Processing of Legal Texts.           Processing Workshop 2019. 39–44.
     Springer, 95–121.                                                                        [16] Keishi Tajima, Kenji Hatano, Takeshi Matsukura, Ryoichi Sano, and Katsumi
 [9] Filippo Galgani, Paul Compton, and Achim Hoffmann. 2012. Combining differ-                    Tanaka. 1999. Discovery and Retrieval of Logical Information Units in Web.. In
     ent summarization techniques for legal text. In Proceedings of the workshop on                WOWS. 13–23.
     innovative hybrid approaches to the processing of textual data. Association for          [17] Adam Wyner, Raquel Mochales-Palau, Marie-Francine Moens, and David Milward.
     Computational Linguistics, 115–123.                                                           2010. Approaches to text mining arguments from legal cases. In Semantic
[10] Aikaterini-Lida Kalouli, Leo Vrana, Vigile Marie Fabella, Luna Bellani, and An-               processing of legal texts. Springer, 60–79.
     nette Hautli-Janisz. 2018. CoUSBi: A Structured and Visualized Legal Corpus of           [18] Liangzhi Yu, Zhenjia Fan, and Anyi Li. 2019. A hierarchical typology of scholarly
     US State Bills. In Proceedings of the LREC 2018 “Workshop on Language Resources               information units: based on a deduction-verification study. Journal of Documen-
     and Technologies for the Legal Knowledge Graph”, Miyazaki, Japan.                             tation (2019).
                                                                                              [19] ChengXiang Zhai and John Lafferty. 2006. A risk minimization framework for
                                                                                                   information retrieval. Information Processing & Management 42, 1 (2006), 31–55.