Automated Directive Extraction from Policy Texts Karl Branting Carlos Balhana Jim Finegan Language Technology Lab David Shin University of Cambridge Cambridge, UK Stacy Petersen ceb81@cam.ac.uk The MITRE Corporation McLean, VA, USA lbranting,jfinegan,hshin,spetersen@mitre.org Alex Lyte Craig Pfeifer The MITRE Corporation The MITRE Corporation Bedford, MA, USA Ann Arbor, MI, USA alyte@mitre.org cpfeifer@mitre.org ABSTRACT streams of publications to identify changes affecting their cyberse- Federal agencies must comply with directives expressed in docu- curity profile (i.e., policies, practices, procedures, standards, and/or ments issued by authoritative sources elsewhere in the government. guidance). To automate identification of these directives, the ADEPT (Auto- A similar monitoring task is required for all other areas within mated Directive Extraction from Policy Texts) system exploits the an agency where compliance is compulsory, such as privacy, health observation that directive sentences are usually characterized by de- policy, and processing of sensitive information. An algorithmic ontic modality (e.g. “must”, “shall”, etc.) permitting the open-ended process that automated the identification of sentences expressing task of summarizing obligations to be reduced to a well-defined obligations incumbent upon a given agency could significantly and circumscribed linguistic analysis task. ADEPT comprises a lin- reduce the burden on staff having to review a large stream of docu- earizer, which converts deeply nested sentences into a form that ments. Such automated processes could provide agencies with early can be handled by standard parsers, a deontic sentence classifier warnings of pending obligations, enabling them to better plan for trained on an annotated corpus of sentences drawn from repre- implementation once the obligation is finalized. sentative policy documents, a semantic role analyzer, and other A key observation of human performance on the document- analytic tools for extracting and analyzing the deontic content of monitoring task is that the summaries produced by staff typically policy documents. focus on sentences that express obligations, i.e., that are character- ized by deontic modality. This suggests that the tasks of monitoring and extracting directive sentences depend critically on the identifi- 1 INTRODUCTION cation of such deontic sentences. We hypothesize that exploiting Modern administrative states are regulated by statutes, regulations, this observation will permit an important portion of the open-ended and other authoritative legal sources that are expressed in complex, task of summarizing obligations to be reduced to a well-defined interconnected texts. Compliance with these rules is challenging and circumscribed linguistic analysis task. for agencies, citizens, rule-drafters, and attorneys alike. For agen- The remainder of this paper describes the design of a system for cies, compliance requires understanding changes in federal laws, automated extraction of directives, ADEPT, and the evaluation of executive orders, and authoritative directives, policies, regulations, the critical deontic-sentence classification component. Section 2 and standards. Simply identifying and summarizing these changes, presents examples of directives and describes the characteristics which often originate from a multitude of sources, can be a bur- that distinguish directives from non-directives and different types densome drain on staff resources. The diversity of authoritative of directives from one another. Section 3 discusses prior related sources imposing requirements of a given nature is typified by the work on modality classification, and the handling of nested di- proliferation of cybersecurity requirements on U.S. federal agencies. rectives, that is, sentences where dependent clauses or sentential Directives can be expressed in Executive Orders, Office of Manage- complements share a common root clause is discussed in Section 4. ment and Budget (OMB) circulars and memoranda, Department of Section 5 sets forth ADEPT’s approach to identifying and classifying Homeland Security (DHS) Binding Operational Directives (BODs), directive sentences, and Section 6 describes the use of semantic role National Institute of Standards and Technology (NIST) Federal In- labeling and frame instantiation to extract structured knowledge formation Processing Standards (FIPS), and Special Publications from sentences identified as directives. The implemented ADEPT (SPs). Each agency must devote staff to monitor and review multiple architecture is described in Section 7, and Section 8 summarizes and outlines future efforts. In: Proceedings of the Workshop on Artificial Intelligence and the Administrative State (AIAS 2019), June 17, 2019, Montreal, QC, Canada. © 2019 for this paper by The MITRE Corporation. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Published at http://ceur-ws.org AIAS’19, June 17, 2019, MTL, CA Karl Branting, Jim Finegan, David Shin, Stacy Petersen, Carlos Balhana, Alex Lyte, and Craig Pfeifer 2 DIRECTIVE SENTENCES IN POLICY construction of the sentence. We hypothesize that summaries con- DOCUMENTS sisting of these deontic sentences closely match existing work prod- ucts by agency personnel who currently monitor such documents ADEPT is based on an analysis of the work products of subject and that summaries of this type could benefit agencies by enabling matter experts engaged in monitoring federal policy documents agency personnel to quickly identify the impact of new obliga- originating from the authoritative sources such as those listed in tions, improving an agency’s capability for complete and timely Section 1. Analysis of these sentences revealed that directives typi- compliance. cally consist of expressions of obligations on the part of an agency or other government entity to perform or refrain from some speci- fied actions, such as: (1) Agencies must establish performance goals. (2) Agencies are required to provide narrative responses regard- 3 RELATED WORK ing their risk management decision process. Providing assistance to agencies in complying with complex reg- (3) Each agency business owner is directed to ensure that 3DES ulatory and policy constraints is increasingly recognized as an and RC4 ciphers are disabled on mail servers. important AI application. Typical examples include development of (4) Chief Information Officers are to submit a report within 180 knowledge acquisition techniques to increase the agility in public days. administration [4] and information retrieval techniques optimized for regulatory texts [6]. Research in this area has addressed both These directive sentences can be viewed as illocutionary [3] or cross-document relationships among regulatory and statutory texts, performative texts [22] that make a given action compulsory for a such as network structure [14], and within-document analysis, such given government entity (i.e., the agency or a holder of a role within as discourse analysis of regulatory paragraphs [5] and parsing statu- the agency). Frequently, as in sentence 1 above, directive sentences tory and regulatory rule texts into a computer-interpretable form use modal verbs, such as “must”, “shall”, “may”, and “should”, as [24]. The work most closely related to the objectives of the current auxiliaries [20]. However, sentences 2–4 illustrate that obligations work is [17], which addressed sentential modality classification of can be expressed without the use of modal verbs. sentences in financial regulation texts. In addition to these absolute, i.e., unqualified, sentences, there A number of previous research projects have addressed the gen- are two other types of sentences that are important for some, but eral task of modal sense disambiguation in legal and government not all, applications. texts. Marasović and Frank [15] developed a classifier for epistemic, First, some directives are qualified in the sense of expressing ei- deontic, and dynamic modal categories in English and German using ther permission or weak necessity, as in the following two sentences: a one-layer convolutional neural network (CNN) with feature maps and semantic feature detectors, reporting better results than with (5) Senior executives may consider delaying awarding new MaxEnt or a one-layer neural network. O’Neill et al. [17] combined financial assistance obligations (permission). a neural network with both legal-specific and more general distri- (6) Agencies should establish and report other meaningful per- butional semantic model representations to distinguish among the formance indicators and goals (weak necessity). deontic modalities obligation, prohibition, and permission. Wyners and Peters [19] used a rule-based approach to extract conditional Second, some sentences merely report an obligation created by a and deontic rules from the U.S. Federal Code of Regulations. They different document, rather than creating an obligation themselves, found that this approach worked well for a specific set of regulatory such as: texts, but its generality is unclear. Maat et al. [7] compared machine (7) Section 1 of the Executive Order requires agency heads to learning approaches to knowledge-based approaches for legal text ensure appropriate risk management. classification in Dutch legislation, finding that while machine learn- ing classifiers performed as well as the pattern-based model, the We term such sentences indirect obligation sentences. pattern-based approach generalized better than the machine learn- We exclude sentences from our set of directive sentences those ing model to new texts. that specify the details of an obligation created in a different sen- The modality classification task addressed by ADEPT differs tence, e.g., by elaborating on the requirements of a work product from this prior work in that it focuses on the deontic distinctions obligation: relevant specifically for the task of extracting and summarizing the (8) Reports must enumerate performance goals. directives from administrative and policy documents, e.g., distin- guishing deontic from non-deontic sentences and distinguishing We treat these sentences as non-directives because they provide among the categories of deontic sentences relevant to a particular details of obligatory actions but do not in themselves create an application (e.g., absolute and qualified obligations). As discussed obligation for an agency or other government entity. We defer below, ADEPT additionally addresses tasks both upstream from handling of these sentences to future applications. deontic sentence detection, such as linearization of nested direc- In summary, we found that directive summaries extracted from tive sentences, and downstream, such as instantiation of obligation policy documents by human experts typically have deontic force, frames and conversion of instantiated frames into a structured form which may be absolute, qualified, or indirect, depending on the useful to agency personnel. Automated Directive Extraction from Policy Texts AIAS’19, June 17, 2019, MTL, CA be implicitly conjunctive, so linearization into a set of separate individual directives, each corresponding to a path in the depth-first traversal of the tree representing the logical form of the sentence, is generally consistent with the intended semantics of the original nested form. As a practical matter, the greatest challenge in documents pub- lished in PDF (the primary format used by the agencies that we support) is determining the nesting level of each constituent clause with respect to surrounding clauses. Text extracted using standard tools, such as Apache Tika [2] and Tesseract [23], does not reli- ably retain the indentation depths of the original PDF. Punctuation Figure 1: A typical nested directive sentence. By itself, punc- marks often signal the nesting level, e.g., a clause that ends with a tuation is insufficient to disambiguate whether the phrase colon is to be followed by one or more subordinate (more deeply in the box is a child of “Enhance email security . . . ” or nested) clauses, and a period usually indicates a leaf node. However, “Within 30 calendar days . . . ”. Either indentations or enu- there is an inherent ambiguity in sentences that follow a leaf node, meration/itemization marks are required to resolve this am- such as the sentence in the box in Figure 1: “Within 120 days after biguity. issuance of this directive, ensuring:”. Without either an unambigu- ous indication of indentation depth relative to surrounding clauses 4 HANDLING NESTED DIRECTIVES or an enumeration mark signaling a clear relationship to other Authoritative administrative texts, including directives, regulations, lines of enumerated text, it is impossible to determine whether this and statutes, are often expressed in the form of nested enumera- sentence is (1) at the level of the sentence that starts “Within 90 tions, such as the directive set forth in Figure 1. Nested structures days”, (2) at the level of the sentence that starts “Enhance email are characterized by multiple dependent clauses or sentential com- security by:”, or (3) the start of a new nested expression. plements to common superordinate clauses. Such structures are The lack of accurate indentation depths in text extracted from intended to express complex rules and directives in a compact and PDF documents and the ambiguity of the typical punctuation con- comprehensible style by reducing textual redundancy. Human read- ventions suggest that the enumeration and bullet symbols and ers can easily understand the logical structure of such sentences punctuation must be the source of nesting information. After all, because the relationships among clauses are signaled by hierar- these are generally unambiguous for human readers. Unfortunately, chical relations between varying levels of enumeration symbols, there is no canonical hierarchical practice of enumerations and punctuation marks, and varying indentation depths. bullets; document conventions vary not just among agencies but Unfortunately, parsers trained on standard treebanks, which often within the same issuing agency as well from one document to are generally based on articles from news sources such as the the next. Enumeration and bulleting formats are sometimes applied Wall Street Journal, are often unable to process sentences with inconsistently even within the same document. Our strategy is nested enumerations [16]. Thus, until domain-specific treebanks therefore to make an initial traversal of each document, recording have been developed for legal texts which include nested sentences, the order of occurrence of each of a standard set of possible enu- it will remain necessary to convert such sentences into a logically- meration styles and conventions to establish a given document’s equivalent representations that are more amenable to conventional hierarchical structure in each section. Each nested expression is parsers. then replaced with its linearized equivalent as determined from the One approach to simplifying the syntactic structure of nested hierarchy determined in the initial pass. The Appendix sets forth enumerations is to convert them into a series of unnested sentences this procedure in more detail. “by starting from the root of the tree and by concatenating, for each Our approach differs from Dragoni et al. [9], which mapped possible path, the phrases found until the leaves are reached” [9]. enumerated propositions onto a legal ontology to define the domain Each depth-first traversal of this tree is a simple (non-compound) of directives and their constituent subparts, in using a concept- sentence. We refer to this process as linearization. For example, the agnostic approach that may be better suited for domains in which first sentence in a linearization of the nested sentence shown in directives are frequently revised, rescinded, or recontextualized in Figure 1 is: ways that may not be amenable to previous ontologies. The extraction tools described below are intended to remove (9) All agencies are required to within 30 calendar days after reference footnotes, HTML links, page numbers, and other extrane- issuance of this directive, develop and provide to DHS an ous information from within the span of single extracted sentences, agency Plan of Action for BOD 18-01 to enhance email se- but remaining bits of extraneous text create challenges for NLP curity by within 90 days after issuance of this directive processes downstream in our pipeline, such as POS and depen- configuring all internet-facing mail servers to offer STRT- dency parsing, event extraction, and modality detection. The last TLS. step of the linearization component therefore attempts to push Linearization of regulatory and statutory text can be complicated these remaining items to the bottom of the linearized document as by ambiguity in the scope of logical connectives that can arise from standardized endnotes. inconsistencies in expressing conjunction and disjunction in legal texts [1]. Nested directives, on the other hand, appear to generally AIAS’19, June 17, 2019, MTL, CA Karl Branting, Jim Finegan, David Shin, Stacy Petersen, Carlos Balhana, Alex Lyte, and Craig Pfeifer Table 1: The proportion of sentences of each of the 3 direc- P R F1 ROC Area Class tive types and of non-directive sentences having a modal 0.784 0.809 0.796 0.934 Absolute auxiliary. 0.795 0.720 0.755 0.854 Qualified 0.574 0.301 0.395 0.771 Indirect Type Ratio Percent 0.845 0.898 0.871 0.855 Non-Directive Absolute 461/592 77.8% 0.812 0.818 0.812 0.866 Weighted Avg. Qualified 378/461 82.0% Table 2: Four-category deontic sentence prediction accuracy. Indirect 42/103 40.8% Non-directive 346/1426 24.3% Total 1227/2582 47.5% 5.2 Evaluation of Deontic Sentence Classification 5 DIRECTIVE SENTENCE CLASSIFICATION We converted each sentence of our corpus into a vector of semantic Our working hypothesis is that policy-document summaries con- role values using AllenNLP [10]. These vectors were converted to sisting of some or all categories of directive sentences described ARFF format5 and evaluated in 10-fold cross-validation using the above can be a proxy for, assist in the creation of, or supplement Weka [12] implementation support vector machine (SVM) (Platt’s manually-created compliance summaries. Thus, we focus on classi- algorithm for sequential minimal optimization [13] [21]). As shown fying sentences with respect to these directive sentence categories. in Table 2, a mean F-score of 0.812 was achieved across all four categories. A mean F-score of 0.846 (with ROC Area of 0.689) was 5.1 Directive-Sentence Corpus obtained for the binary task of distinguishing non-directives from Unfortunately, none of the models or corpora developed in the any of the 3 types of directive sentences. prior work on sentence modality classification described above are This experiment indicates that the deontic categories of relevance directly applicable to our task. We therefore found it necessary to to our task can be distinguished by a model trained on a corpus develop a new annotated directive sentence corpus based on U.S. of modest size. We anticipate that this accuracy can be improved executive-branch policy directives. Our initial focus was on OMB by expanding the annotated data set size and refining the text Memoranda and DHS Binding Operational Directives, for which extraction and linearization processes that provide input into the we had examples of agency work products. We downloaded 5 years classifier. of OMB directives from the White House website.1 Each of the documents in the corpus was originally published 6 SEMANTIC ROLE LABELING AND in PDF format, usually with the first page scanned and signed. TEMPLATE INSTANTIATION Each document was converted to plain text using the Apache Tika For many agency applications, the most useful representation of software package [2]. In parallel, each document was processed directives is often in the form of structured tables or spreadsheets with Grobid [11] to identify elements such as headers and footers summarizing multiple sentences. Analysis of representative work that can interrupt text that spans from one page to the next. The products indicated that the information of interest from each sen- elements identified using Grobid were disinterleaved from the main tence includes the following: text and concatenated at the end of each document.2 • Actor - the agency or office to which the obligation applies As described in Section 4, policy documents often contain com- • Activity - the activity that is required of the Actor plex sentences, including bullet-pointed lists and enumerations, that • Object - the work product to be produced by the Activity, if establish multiple distinct obligations. Accordingly, each nested any sentence in the corpus was converted into a set of simple sentences • Time - any time-related qualification of the directed activity using the linearization process described in Section 4. Each of the • Manner - any non-time-related qualification of the directed resulting sentences was then annotated according to the categories activity set forth in Section 2 by several annotators, including a subject- • Modal - whether the activity is obligatory, permitted, or matter expert and several linguists. suggested, as indicated by the particular modal or other verb The resulting set of 2,582 labeled sentences served as ground used to convey the deontic character of the expression, i.e., truth in the construction of the machine learning-based models “must” vs. “may.” described below. The mean length of these sentences was 38 tokens. Table 1 shows the proportion of sentences of each of the 3 directive For each directive, we instantiate a frame containing argument slots types that have a modal auxiliary.3 These ratios illustrate that the for each of the types of information above. For example, the in- presence of modal auxiliaries is neither necessary nor sufficient for stantiated frame shown in Table 3 summarizes the key information directives in this domain.4 from the following directive sentence: 1 https://www.whitehouse.gov/omb/information-for-agencies/memoranda/ (10) Within 60 days of this Memorandum’s publication agencies 2 Footnote texts must be retained because they sometimes contain directives. must update their list of non-governmental URLs. 3 Modal verbs included can, could, may, might, must, shall, should, will, or would 4 This annotated corpus will be made available to researchers in 2019 at http://mat- annotation.sourceforge.net/. 5 https://www.cs.waikato.ac.nz/ ml/weka/arff.html Automated Directive Extraction from Policy Texts AIAS’19, June 17, 2019, MTL, CA Table 3: An instantiated directive template. 8 DISCUSSION AND FUTURE WORK ADEPT illustrates how a document analysis task that imposes a Actor agencies significant burden to a wide range of agencies—directive extraction— Activity update can be addressed by deontic sentence classification in combination Object list of non-governmental URLs with nested sentence disambiguation and semantic role labeling. Time within 60 days We anticipate that an ADEPT directive-extraction pilot will take Modal must place in mid-2019 with a representative U.S. federal agency. Future work will relax ADEPT’s current simplifying assumption that the directive content of policy documents can be determined by analyzing individual sentences divorced from their surrounding context. For within-document contextual information, we plan to introduce entity resolution and link connecting sentences that elaborate on an obligation with the obligation sentence to which they apply. To improve cross-document contextual information, we plan to develop techniques to detect and classify references to other documents, particularly statements that the current document rescinds directives from other policy documents. Automated analysis of policy documents presents a rich set of text-analytic tasks but promises very significant rewards to both agencies and citizens. ADEPT represents an initial realization of this Figure 2: The directive sentence processing pipeline. approach to improving the administrative state through modern computational linguistics techniques. ACKNOWLEDGMENTS The slots in the directive frame are a domain-specific adaptation The MITRE Corporation is a not-for-profit company, chartered in of standard semantic roles. We use the Semantic Role Labeling the public interest, that operates multiple federally funded research model of AllenNLP [10] to assign Propbank semantic role labels and development centers. This document is approved for Public [18] to directive sentences. We then use a set of simple heuristic Release; Distribution Unlimited. Case Number 18-4602. rules for mapping these SRLs to the slots of our frames, e.g., a Propbank “ARG0” is generally the Actor, “ARG1” is generally the REFERENCES Object, and “Temporal” corresponds to the Time slot. Directives [1] L. Allen and C. Saxon. More IA needed in AI: Interpetation assistance for coping expressed without a modal verb ("All agencies are required to ...") with the problem of multiple structural interpetations. In Proceedings of the Third International Conference on Artificial Intelligence and Law, pages 53–61, Oxford, will have no entry in the "Modal" field. England, June 25–28 1991. [2] Apache tika - a content analysis toolkit. https://tika.apache.org/. Accessed: 2018- 11-16. 7 SYSTEM ARCHITECTURE [3] J. Austin. How to do things with words. Oxford U. Press, New York, 1962. [4] A. Boer and T. van Engers. An agent-based legal knowledge acquisition method- As illustrated in Figure 2, ADEPT’s directive extraction and anal- ology for agile public administration. In Proceedings of the 13th International ysis tasks require a series of processing steps. We have adopted a Conference on Artificial Intelligence and Law, ICAIL ’11, pages 171–180, New York, NY, USA, 2011. ACM. modular architecture that can accommodate a variety of alternative [5] A. Buabuchachart, K. Metcalf, N. Charness, and L. Morgenstern. Classification of components. regulatory paragraphs by discourse structure, reference structure, and regulation The first stage of the pipeline consists of concurrent calls to the type. In Proceedings of the 26th International Conference on Legal Knowledge-Based Systems JURIX, University of Bologna, Bologna, Italy, November 2013. APIs of the Tika and Grobid services offered by their respective [6] D. Collarana, T. Heuss, J. Lehmann, I. Lytra, G. Maheshwari, R. Nedelchev, Docker [8] containers. Tika outputs the PDF extraction as plain T. Schmidt, and P. Trivedi. A question answering system on regulatory doc- uments. In Proceedings of the 31st international conference on Legal Knowledge text whereas Grobid outputs the footnotes embedded in XML. The and Information Systems (JURIX), 2018. merge stage integrates this content and outputs a text file consist- [7] E. de Maat, K. Krabben, and R. Winkels. Machine learning versus knowledge ing of disinterleaved page content followed by all footnotes. The based classification of legal texts. In Proceedings of the 2010 Conference on Legal Knowledge and Information Systems: JURIX 2010: The Twenty-Third Annual Con- linearizer takes this text as input and outputs a text file containing ference, pages 87–96, Amsterdam, The Netherlands, The Netherlands, 2010. IOS one linearized sentence per line. The linguistic feature extrac- Press. tor converts each sentence into a feature vector of n-grams and [8] DOCKER. https://www.docker.com/. Accessed: 2019-01-24. [9] M. Dragoni, S. Villata, W. Rizzi, and G. Governatori. Combining NLP Approaches features derived from a dependency parse. for Rule Extraction from Legal Documents. In 1st Workshop on MIning and REa- An API call to the Docker container of the AllenNLP service soning with Legal texts (MIREL 2016), Sophia Antipolis, France, Dec. 2016. [10] M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. E. Peters, is then made with a JSON file containing all sentences identified M. Schmitz, and L. Zettlemoyer. Allennlp: A deep semantic natural language as being of the target deontic type or types (e.g., absolute). The processing platform. CoRR, abs/1803.07640, 2018. AllenNLP output is passed to the template instantiation stage. The [11] Grobid (or grobid) means GeneRation of BIbliographic data. https://grobid. readthedocs.io/en/latest/. Accessed: 2018-12-18. final output consists of CSV and HTML files that can be loaded into [12] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The a spreadsheet or viewed through a web browser. weka data mining software: An update. SIGKDD Explorations, 11(1), 2009. AIAS’19, June 17, 2019, MTL, CA Karl Branting, Jim Finegan, David Shin, Stacy Petersen, Carlos Balhana, Alex Lyte, and Craig Pfeifer [13] S. Keerthi, S. Shevade, C. Bhattacharyya, and K. Murthy. Improvements to platt’s (Uppercase Roman Numerals, Lowercase Roman Numerals, Upper- smo algorithm for svm classifier design. Neural Computation, 13(3):637–649, 2001. case Letters, Lowercase Letters, Number Digits, Solid Bullet Points, [14] M. Koniaris, I. Anagnostopoulos, and Y. Vassiliou. Network analysis in the legal domain: a complex model for european union legal sources. Journal of Complex Hollow Bullet Points) Networks, 6(2):243–268, 2018. STORE the sequential order (i.e., layers) of enumeration styles en- [15] A. Marasović and A. Frank. Multilingual modal sense classification using a con- volutional neural network. In P. Blunsom, K. Cho, S. B. Cohen, E. Grefenstette, countered to set document convention, where each layer begins K. M. Hermann, L. Rimell, J. Weston, and S. W. Yih, editors, Proceedings of the 1st with its own closet set of enumeration symbols Workshop on Representation Learning for NLP, Rep4NLP@ACL 2016, Berlin, Ger- FOR lower-order layers many, August 11, 2016, pages 111–120. Association for Computational Linguistics, 2016. CONCATENATE lines recursively with all parent layers [16] L. Morgenstern. Toward automated international law compliance monitoring TERMINATE upon reaching new paragraph with no enumeration (tailcm). Technical report, LEIDOS, INC, 2014. AFRL-RI-RS-TR-2014-206. symbol at the start of the line [17] J. O’Neill, P. Buitelaar, C. Robin, and L. O’Brien. Classifying sentential modality in legal language: a use case in financial regulations, acts and directives. In ITERATE over all sections Proceedings of the 16th edition of the International Conference on Articial Intelligence WRITE to [FILENAME]_paths.txt file and Law, ICAIL 2017, London, United Kingdom, June 12-16, 2017, pages 159–168, 2017. Standardize Global Enumeration: Rewrite enumeration [18] M. Palmer, D. Gildea, and P. Kingsbury. The proposition bank: An annotated conventions to standard format (e.g. I.iii.B.a. → 1.3.2.1.) corpus of semantic roles. Comput. Linguist., 31(1):71–106, Mar. 2005. [19] W. Peters and A. Z. Wyner. Legal text interpretation: Identifying hohfeldian FOR all enumerated lists, relations from text. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, REWRITE each line’s enumeration symbol with its corresponding B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and digit based on the layer order and within-layer order Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016. European Language WRITE to [FILENAME]_trees.txt file Resources Association (ELRA), 2016. [20] The Plain Writing Act of 2010, 2010. 111th Congress H.R. 946. Post-Process Footnotes: Add previously extracted foot- [21] J. C. Platt. Fast training of support vector machines using sequential minimal notes to the bottom of document optimization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods, pages 185–208. MIT Press, Cambridge, MA, USA, 1999. APPEND footnote elements to bottom of the [FILENAME]_paths.txt [22] J. Searle. Speech Acts: An Essay in the Philosophy of Language. Cambridge Univer- file under the new section header “Footnotes” sity Press, Cambridge, 1969. [23] Tesseract ocr. https://opensource.google.com/projects/tesseract. Accessed: 2018- 11-16. [24] A. Wyner and W. Peters. On rule extraction from regulations. Frontiers in Artificial Intelligence and Applications, (235), January 2011. 9 APPENDIX: LINEARIZATION OF NESTED DIRECTIVES FOR each document ingested by the linearizer: Preprocess: Remove footnotes to prevent splitting of enumerated list elements or main body sentences dur- ing downstream processing later in the classification pipeline EXTRACT strings matching footnote format STORE matching strings in References array DELETE matching strings in their original positions DELETE all multiple (n-1) vertical and horizontal spacing Detect Document Section Boundaries: Identify positions of each document section to prevent enumerated ele- ments from spanning multiple distinct lists. MATCH list of known section headers STORE matches in partition along with starting offset position for each section in index READ any enumerated lists in between section boundaries Parse and Concatenate Enumerations: Map document hierarchical enumeration conventions against different symbol sets. Concatenate all directly subordinated sen- tence fragments with their subordinating fragments to form full (flat) sentences from the enumerated elements for downstream processing later in the classification pipeline. MATCH lines in each enumerated list within each section against enumeration symbol style list delimited by punctuation cues