Automated Directive Extraction from Policy Texts
                                  Karl Branting                                                                 Carlos Balhana
                                   Jim Finegan                                                             Language Technology Lab
                                    David Shin                                                              University of Cambridge
                                                                                                                Cambridge, UK
                                  Stacy Petersen
                                                                                                               ceb81@cam.ac.uk
                        The MITRE Corporation
                            McLean, VA, USA
             lbranting,jfinegan,hshin,spetersen@mitre.org

                                      Alex Lyte                                                                  Craig Pfeifer
                            The MITRE Corporation                                                           The MITRE Corporation
                              Bedford, MA, USA                                                               Ann Arbor, MI, USA
                               alyte@mitre.org                                                                cpfeifer@mitre.org

ABSTRACT                                                                                  streams of publications to identify changes affecting their cyberse-
Federal agencies must comply with directives expressed in docu-                           curity profile (i.e., policies, practices, procedures, standards, and/or
ments issued by authoritative sources elsewhere in the government.                        guidance).
To automate identification of these directives, the ADEPT (Auto-                             A similar monitoring task is required for all other areas within
mated Directive Extraction from Policy Texts) system exploits the                         an agency where compliance is compulsory, such as privacy, health
observation that directive sentences are usually characterized by de-                     policy, and processing of sensitive information. An algorithmic
ontic modality (e.g. “must”, “shall”, etc.) permitting the open-ended                     process that automated the identification of sentences expressing
task of summarizing obligations to be reduced to a well-defined                           obligations incumbent upon a given agency could significantly
and circumscribed linguistic analysis task. ADEPT comprises a lin-                        reduce the burden on staff having to review a large stream of docu-
earizer, which converts deeply nested sentences into a form that                          ments. Such automated processes could provide agencies with early
can be handled by standard parsers, a deontic sentence classifier                         warnings of pending obligations, enabling them to better plan for
trained on an annotated corpus of sentences drawn from repre-                             implementation once the obligation is finalized.
sentative policy documents, a semantic role analyzer, and other                              A key observation of human performance on the document-
analytic tools for extracting and analyzing the deontic content of                        monitoring task is that the summaries produced by staff typically
policy documents.                                                                         focus on sentences that express obligations, i.e., that are character-
                                                                                          ized by deontic modality. This suggests that the tasks of monitoring
                                                                                          and extracting directive sentences depend critically on the identifi-
1     INTRODUCTION
                                                                                          cation of such deontic sentences. We hypothesize that exploiting
Modern administrative states are regulated by statutes, regulations,                      this observation will permit an important portion of the open-ended
and other authoritative legal sources that are expressed in complex,                      task of summarizing obligations to be reduced to a well-defined
interconnected texts. Compliance with these rules is challenging                          and circumscribed linguistic analysis task.
for agencies, citizens, rule-drafters, and attorneys alike. For agen-                        The remainder of this paper describes the design of a system for
cies, compliance requires understanding changes in federal laws,                          automated extraction of directives, ADEPT, and the evaluation of
executive orders, and authoritative directives, policies, regulations,                    the critical deontic-sentence classification component. Section 2
and standards. Simply identifying and summarizing these changes,                          presents examples of directives and describes the characteristics
which often originate from a multitude of sources, can be a bur-                          that distinguish directives from non-directives and different types
densome drain on staff resources. The diversity of authoritative                          of directives from one another. Section 3 discusses prior related
sources imposing requirements of a given nature is typified by the                        work on modality classification, and the handling of nested di-
proliferation of cybersecurity requirements on U.S. federal agencies.                     rectives, that is, sentences where dependent clauses or sentential
Directives can be expressed in Executive Orders, Office of Manage-                        complements share a common root clause is discussed in Section 4.
ment and Budget (OMB) circulars and memoranda, Department of                              Section 5 sets forth ADEPT’s approach to identifying and classifying
Homeland Security (DHS) Binding Operational Directives (BODs),                            directive sentences, and Section 6 describes the use of semantic role
National Institute of Standards and Technology (NIST) Federal In-                         labeling and frame instantiation to extract structured knowledge
formation Processing Standards (FIPS), and Special Publications                           from sentences identified as directives. The implemented ADEPT
(SPs). Each agency must devote staff to monitor and review multiple                       architecture is described in Section 7, and Section 8 summarizes
                                                                                          and outlines future efforts.
In: Proceedings of the Workshop on Artificial Intelligence and the Administrative State
(AIAS 2019), June 17, 2019, Montreal, QC, Canada.
© 2019 for this paper by The MITRE Corporation. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
Published at http://ceur-ws.org
AIAS’19, June 17, 2019, MTL, CA         Karl Branting, Jim Finegan, David Shin, Stacy Petersen, Carlos Balhana, Alex Lyte, and Craig Pfeifer


2     DIRECTIVE SENTENCES IN POLICY                                        construction of the sentence. We hypothesize that summaries con-
      DOCUMENTS                                                            sisting of these deontic sentences closely match existing work prod-
                                                                           ucts by agency personnel who currently monitor such documents
ADEPT is based on an analysis of the work products of subject
                                                                           and that summaries of this type could benefit agencies by enabling
matter experts engaged in monitoring federal policy documents
                                                                           agency personnel to quickly identify the impact of new obliga-
originating from the authoritative sources such as those listed in
                                                                           tions, improving an agency’s capability for complete and timely
Section 1. Analysis of these sentences revealed that directives typi-
                                                                           compliance.
cally consist of expressions of obligations on the part of an agency
or other government entity to perform or refrain from some speci-
fied actions, such as:

    (1)   Agencies must establish performance goals.
    (2)   Agencies are required to provide narrative responses regard-     3   RELATED WORK
          ing their risk management decision process.
                                                                           Providing assistance to agencies in complying with complex reg-
    (3)   Each agency business owner is directed to ensure that 3DES       ulatory and policy constraints is increasingly recognized as an
          and RC4 ciphers are disabled on mail servers.                    important AI application. Typical examples include development of
    (4)   Chief Information Officers are to submit a report within 180     knowledge acquisition techniques to increase the agility in public
          days.                                                            administration [4] and information retrieval techniques optimized
                                                                           for regulatory texts [6]. Research in this area has addressed both
These directive sentences can be viewed as illocutionary [3] or            cross-document relationships among regulatory and statutory texts,
performative texts [22] that make a given action compulsory for a          such as network structure [14], and within-document analysis, such
given government entity (i.e., the agency or a holder of a role within     as discourse analysis of regulatory paragraphs [5] and parsing statu-
the agency). Frequently, as in sentence 1 above, directive sentences       tory and regulatory rule texts into a computer-interpretable form
use modal verbs, such as “must”, “shall”, “may”, and “should”, as          [24]. The work most closely related to the objectives of the current
auxiliaries [20]. However, sentences 2–4 illustrate that obligations       work is [17], which addressed sentential modality classification of
can be expressed without the use of modal verbs.                           sentences in financial regulation texts.
   In addition to these absolute, i.e., unqualified, sentences, there         A number of previous research projects have addressed the gen-
are two other types of sentences that are important for some, but          eral task of modal sense disambiguation in legal and government
not all, applications.                                                     texts. Marasović and Frank [15] developed a classifier for epistemic,
   First, some directives are qualified in the sense of expressing ei-     deontic, and dynamic modal categories in English and German using
ther permission or weak necessity, as in the following two sentences:      a one-layer convolutional neural network (CNN) with feature maps
                                                                           and semantic feature detectors, reporting better results than with
    (5)   Senior executives may consider delaying awarding new
                                                                           MaxEnt or a one-layer neural network. O’Neill et al. [17] combined
          financial assistance obligations (permission).
                                                                           a neural network with both legal-specific and more general distri-
    (6)   Agencies should establish and report other meaningful per-       butional semantic model representations to distinguish among the
          formance indicators and goals (weak necessity).                  deontic modalities obligation, prohibition, and permission. Wyners
                                                                           and Peters [19] used a rule-based approach to extract conditional
   Second, some sentences merely report an obligation created by a
                                                                           and deontic rules from the U.S. Federal Code of Regulations. They
different document, rather than creating an obligation themselves,
                                                                           found that this approach worked well for a specific set of regulatory
such as:
                                                                           texts, but its generality is unclear. Maat et al. [7] compared machine
    (7)   Section 1 of the Executive Order requires agency heads to        learning approaches to knowledge-based approaches for legal text
          ensure appropriate risk management.                              classification in Dutch legislation, finding that while machine learn-
                                                                           ing classifiers performed as well as the pattern-based model, the
We term such sentences indirect obligation sentences.                      pattern-based approach generalized better than the machine learn-
   We exclude sentences from our set of directive sentences those          ing model to new texts.
that specify the details of an obligation created in a different sen-         The modality classification task addressed by ADEPT differs
tence, e.g., by elaborating on the requirements of a work product          from this prior work in that it focuses on the deontic distinctions
obligation:                                                                relevant specifically for the task of extracting and summarizing the
    (8)   Reports must enumerate performance goals.                        directives from administrative and policy documents, e.g., distin-
                                                                           guishing deontic from non-deontic sentences and distinguishing
We treat these sentences as non-directives because they provide            among the categories of deontic sentences relevant to a particular
details of obligatory actions but do not in themselves create an           application (e.g., absolute and qualified obligations). As discussed
obligation for an agency or other government entity. We defer              below, ADEPT additionally addresses tasks both upstream from
handling of these sentences to future applications.                        deontic sentence detection, such as linearization of nested direc-
  In summary, we found that directive summaries extracted from             tive sentences, and downstream, such as instantiation of obligation
policy documents by human experts typically have deontic force,            frames and conversion of instantiated frames into a structured form
which may be absolute, qualified, or indirect, depending on the            useful to agency personnel.
Automated Directive Extraction from Policy Texts                                                            AIAS’19, June 17, 2019, MTL, CA


                                                                         be implicitly conjunctive, so linearization into a set of separate
                                                                         individual directives, each corresponding to a path in the depth-first
                                                                         traversal of the tree representing the logical form of the sentence,
                                                                         is generally consistent with the intended semantics of the original
                                                                         nested form.
                                                                            As a practical matter, the greatest challenge in documents pub-
                                                                         lished in PDF (the primary format used by the agencies that we
                                                                         support) is determining the nesting level of each constituent clause
                                                                         with respect to surrounding clauses. Text extracted using standard
                                                                         tools, such as Apache Tika [2] and Tesseract [23], does not reli-
                                                                         ably retain the indentation depths of the original PDF. Punctuation
Figure 1: A typical nested directive sentence. By itself, punc-
                                                                         marks often signal the nesting level, e.g., a clause that ends with a
tuation is insufficient to disambiguate whether the phrase
                                                                         colon is to be followed by one or more subordinate (more deeply
in the box is a child of “Enhance email security . . . ” or
                                                                         nested) clauses, and a period usually indicates a leaf node. However,
“Within 30 calendar days . . . ”. Either indentations or enu-
                                                                         there is an inherent ambiguity in sentences that follow a leaf node,
meration/itemization marks are required to resolve this am-
                                                                         such as the sentence in the box in Figure 1: “Within 120 days after
biguity.
                                                                         issuance of this directive, ensuring:”. Without either an unambigu-
                                                                         ous indication of indentation depth relative to surrounding clauses
4     HANDLING NESTED DIRECTIVES                                         or an enumeration mark signaling a clear relationship to other
Authoritative administrative texts, including directives, regulations,   lines of enumerated text, it is impossible to determine whether this
and statutes, are often expressed in the form of nested enumera-         sentence is (1) at the level of the sentence that starts “Within 90
tions, such as the directive set forth in Figure 1. Nested structures    days”, (2) at the level of the sentence that starts “Enhance email
are characterized by multiple dependent clauses or sentential com-       security by:”, or (3) the start of a new nested expression.
plements to common superordinate clauses. Such structures are               The lack of accurate indentation depths in text extracted from
intended to express complex rules and directives in a compact and        PDF documents and the ambiguity of the typical punctuation con-
comprehensible style by reducing textual redundancy. Human read-         ventions suggest that the enumeration and bullet symbols and
ers can easily understand the logical structure of such sentences        punctuation must be the source of nesting information. After all,
because the relationships among clauses are signaled by hierar-          these are generally unambiguous for human readers. Unfortunately,
chical relations between varying levels of enumeration symbols,          there is no canonical hierarchical practice of enumerations and
punctuation marks, and varying indentation depths.                       bullets; document conventions vary not just among agencies but
   Unfortunately, parsers trained on standard treebanks, which           often within the same issuing agency as well from one document to
are generally based on articles from news sources such as the            the next. Enumeration and bulleting formats are sometimes applied
Wall Street Journal, are often unable to process sentences with          inconsistently even within the same document. Our strategy is
nested enumerations [16]. Thus, until domain-specific treebanks          therefore to make an initial traversal of each document, recording
have been developed for legal texts which include nested sentences,      the order of occurrence of each of a standard set of possible enu-
it will remain necessary to convert such sentences into a logically-     meration styles and conventions to establish a given document’s
equivalent representations that are more amenable to conventional        hierarchical structure in each section. Each nested expression is
parsers.                                                                 then replaced with its linearized equivalent as determined from the
   One approach to simplifying the syntactic structure of nested         hierarchy determined in the initial pass. The Appendix sets forth
enumerations is to convert them into a series of unnested sentences      this procedure in more detail.
“by starting from the root of the tree and by concatenating, for each       Our approach differs from Dragoni et al. [9], which mapped
possible path, the phrases found until the leaves are reached” [9].      enumerated propositions onto a legal ontology to define the domain
Each depth-first traversal of this tree is a simple (non-compound)       of directives and their constituent subparts, in using a concept-
sentence. We refer to this process as linearization. For example, the    agnostic approach that may be better suited for domains in which
first sentence in a linearization of the nested sentence shown in        directives are frequently revised, rescinded, or recontextualized in
Figure 1 is:                                                             ways that may not be amenable to previous ontologies.
                                                                            The extraction tools described below are intended to remove
    (9)   All agencies are required to within 30 calendar days after
                                                                         reference footnotes, HTML links, page numbers, and other extrane-
          issuance of this directive, develop and provide to DHS an
                                                                         ous information from within the span of single extracted sentences,
          agency Plan of Action for BOD 18-01 to enhance email se-
                                                                         but remaining bits of extraneous text create challenges for NLP
          curity by within 90 days after issuance of this directive
                                                                         processes downstream in our pipeline, such as POS and depen-
          configuring all internet-facing mail servers to offer STRT-
                                                                         dency parsing, event extraction, and modality detection. The last
          TLS.
                                                                         step of the linearization component therefore attempts to push
   Linearization of regulatory and statutory text can be complicated     these remaining items to the bottom of the linearized document as
by ambiguity in the scope of logical connectives that can arise from     standardized endnotes.
inconsistencies in expressing conjunction and disjunction in legal
texts [1]. Nested directives, on the other hand, appear to generally
AIAS’19, June 17, 2019, MTL, CA                Karl Branting, Jim Finegan, David Shin, Stacy Petersen, Carlos Balhana, Alex Lyte, and Craig Pfeifer

Table 1: The proportion of sentences of each of the 3 direc-                                    P      R     F1    ROC Area Class
tive types and of non-directive sentences having a modal                                      0.784 0.809 0.796       0.934    Absolute
auxiliary.                                                                                    0.795 0.720 0.755       0.854    Qualified
                                                                                              0.574 0.301 0.395       0.771    Indirect
                     Type                 Ratio          Percent                              0.845 0.898 0.871       0.855    Non-Directive
                   Absolute              461/592           77.8%                              0.812 0.818 0.812       0.866    Weighted Avg.
                  Qualified              378/461           82.0%                       Table 2: Four-category deontic sentence prediction accuracy.
                   Indirect               42/103           40.8%
                 Non-directive          346/1426           24.3%
                    Total               1227/2582          47.5%
                                                                                       5.2     Evaluation of Deontic Sentence
                                                                                               Classification
5     DIRECTIVE SENTENCE CLASSIFICATION                                                We converted each sentence of our corpus into a vector of semantic
Our working hypothesis is that policy-document summaries con-                          role values using AllenNLP [10]. These vectors were converted to
sisting of some or all categories of directive sentences described                     ARFF format5 and evaluated in 10-fold cross-validation using the
above can be a proxy for, assist in the creation of, or supplement                     Weka [12] implementation support vector machine (SVM) (Platt’s
manually-created compliance summaries. Thus, we focus on classi-                       algorithm for sequential minimal optimization [13] [21]). As shown
fying sentences with respect to these directive sentence categories.                   in Table 2, a mean F-score of 0.812 was achieved across all four
                                                                                       categories. A mean F-score of 0.846 (with ROC Area of 0.689) was
5.1     Directive-Sentence Corpus                                                      obtained for the binary task of distinguishing non-directives from
Unfortunately, none of the models or corpora developed in the                          any of the 3 types of directive sentences.
prior work on sentence modality classification described above are                        This experiment indicates that the deontic categories of relevance
directly applicable to our task. We therefore found it necessary to                    to our task can be distinguished by a model trained on a corpus
develop a new annotated directive sentence corpus based on U.S.                        of modest size. We anticipate that this accuracy can be improved
executive-branch policy directives. Our initial focus was on OMB                       by expanding the annotated data set size and refining the text
Memoranda and DHS Binding Operational Directives, for which                            extraction and linearization processes that provide input into the
we had examples of agency work products. We downloaded 5 years                         classifier.
of OMB directives from the White House website.1
   Each of the documents in the corpus was originally published                        6     SEMANTIC ROLE LABELING AND
in PDF format, usually with the first page scanned and signed.                               TEMPLATE INSTANTIATION
Each document was converted to plain text using the Apache Tika                        For many agency applications, the most useful representation of
software package [2]. In parallel, each document was processed                         directives is often in the form of structured tables or spreadsheets
with Grobid [11] to identify elements such as headers and footers                      summarizing multiple sentences. Analysis of representative work
that can interrupt text that spans from one page to the next. The                      products indicated that the information of interest from each sen-
elements identified using Grobid were disinterleaved from the main                     tence includes the following:
text and concatenated at the end of each document.2
                                                                                             • Actor - the agency or office to which the obligation applies
   As described in Section 4, policy documents often contain com-
                                                                                             • Activity - the activity that is required of the Actor
plex sentences, including bullet-pointed lists and enumerations, that
                                                                                             • Object - the work product to be produced by the Activity, if
establish multiple distinct obligations. Accordingly, each nested
                                                                                               any
sentence in the corpus was converted into a set of simple sentences
                                                                                             • Time - any time-related qualification of the directed activity
using the linearization process described in Section 4. Each of the
                                                                                             • Manner - any non-time-related qualification of the directed
resulting sentences was then annotated according to the categories
                                                                                               activity
set forth in Section 2 by several annotators, including a subject-
                                                                                             • Modal - whether the activity is obligatory, permitted, or
matter expert and several linguists.
                                                                                               suggested, as indicated by the particular modal or other verb
   The resulting set of 2,582 labeled sentences served as ground
                                                                                               used to convey the deontic character of the expression, i.e.,
truth in the construction of the machine learning-based models
                                                                                               “must” vs. “may.”
described below. The mean length of these sentences was 38 tokens.
Table 1 shows the proportion of sentences of each of the 3 directive                   For each directive, we instantiate a frame containing argument slots
types that have a modal auxiliary.3 These ratios illustrate that the                   for each of the types of information above. For example, the in-
presence of modal auxiliaries is neither necessary nor sufficient for                  stantiated frame shown in Table 3 summarizes the key information
directives in this domain.4                                                            from the following directive sentence:

1 https://www.whitehouse.gov/omb/information-for-agencies/memoranda/
                                                                                        (10) Within 60 days of this Memorandum’s publication agencies
2 Footnote texts must be retained because they sometimes contain directives.                 must update their list of non-governmental URLs.
3 Modal verbs included can, could, may, might, must, shall, should, will, or would
4 This annotated corpus will be made available to researchers in 2019 at http://mat-
annotation.sourceforge.net/.                                                           5 https://www.cs.waikato.ac.nz/ ml/weka/arff.html
Automated Directive Extraction from Policy Texts                                                                      AIAS’19, June 17, 2019, MTL, CA

         Table 3: An instantiated directive template.                     8     DISCUSSION AND FUTURE WORK
                                                                          ADEPT illustrates how a document analysis task that imposes a
            Actor       agencies                                          significant burden to a wide range of agencies—directive extraction—
            Activity    update                                            can be addressed by deontic sentence classification in combination
            Object      list of non-governmental URLs                     with nested sentence disambiguation and semantic role labeling.
            Time        within 60 days                                    We anticipate that an ADEPT directive-extraction pilot will take
            Modal       must                                              place in mid-2019 with a representative U.S. federal agency.
                                                                             Future work will relax ADEPT’s current simplifying assumption
                                                                          that the directive content of policy documents can be determined
                                                                          by analyzing individual sentences divorced from their surrounding
                                                                          context. For within-document contextual information, we plan
                                                                          to introduce entity resolution and link connecting sentences that
                                                                          elaborate on an obligation with the obligation sentence to which
                                                                          they apply. To improve cross-document contextual information,
                                                                          we plan to develop techniques to detect and classify references to
                                                                          other documents, particularly statements that the current document
                                                                          rescinds directives from other policy documents.
                                                                             Automated analysis of policy documents presents a rich set of
                                                                          text-analytic tasks but promises very significant rewards to both
                                                                          agencies and citizens. ADEPT represents an initial realization of this
    Figure 2: The directive sentence processing pipeline.                 approach to improving the administrative state through modern
                                                                          computational linguistics techniques.

                                                                          ACKNOWLEDGMENTS
   The slots in the directive frame are a domain-specific adaptation      The MITRE Corporation is a not-for-profit company, chartered in
of standard semantic roles. We use the Semantic Role Labeling             the public interest, that operates multiple federally funded research
model of AllenNLP [10] to assign Propbank semantic role labels            and development centers. This document is approved for Public
[18] to directive sentences. We then use a set of simple heuristic        Release; Distribution Unlimited. Case Number 18-4602.
rules for mapping these SRLs to the slots of our frames, e.g., a
Propbank “ARG0” is generally the Actor, “ARG1” is generally the           REFERENCES
Object, and “Temporal” corresponds to the Time slot. Directives            [1] L. Allen and C. Saxon. More IA needed in AI: Interpetation assistance for coping
expressed without a modal verb ("All agencies are required to ...")            with the problem of multiple structural interpetations. In Proceedings of the Third
                                                                               International Conference on Artificial Intelligence and Law, pages 53–61, Oxford,
will have no entry in the "Modal" field.                                       England, June 25–28 1991.
                                                                           [2] Apache tika - a content analysis toolkit. https://tika.apache.org/. Accessed: 2018-
                                                                               11-16.
7   SYSTEM ARCHITECTURE                                                    [3] J. Austin. How to do things with words. Oxford U. Press, New York, 1962.
                                                                           [4] A. Boer and T. van Engers. An agent-based legal knowledge acquisition method-
As illustrated in Figure 2, ADEPT’s directive extraction and anal-             ology for agile public administration. In Proceedings of the 13th International
ysis tasks require a series of processing steps. We have adopted a             Conference on Artificial Intelligence and Law, ICAIL ’11, pages 171–180, New York,
                                                                               NY, USA, 2011. ACM.
modular architecture that can accommodate a variety of alternative         [5] A. Buabuchachart, K. Metcalf, N. Charness, and L. Morgenstern. Classification of
components.                                                                    regulatory paragraphs by discourse structure, reference structure, and regulation
    The first stage of the pipeline consists of concurrent calls to the        type. In Proceedings of the 26th International Conference on Legal Knowledge-Based
                                                                               Systems JURIX, University of Bologna, Bologna, Italy, November 2013.
APIs of the Tika and Grobid services offered by their respective           [6] D. Collarana, T. Heuss, J. Lehmann, I. Lytra, G. Maheshwari, R. Nedelchev,
Docker [8] containers. Tika outputs the PDF extraction as plain                T. Schmidt, and P. Trivedi. A question answering system on regulatory doc-
                                                                               uments. In Proceedings of the 31st international conference on Legal Knowledge
text whereas Grobid outputs the footnotes embedded in XML. The                 and Information Systems (JURIX), 2018.
merge stage integrates this content and outputs a text file consist-       [7] E. de Maat, K. Krabben, and R. Winkels. Machine learning versus knowledge
ing of disinterleaved page content followed by all footnotes. The              based classification of legal texts. In Proceedings of the 2010 Conference on Legal
                                                                               Knowledge and Information Systems: JURIX 2010: The Twenty-Third Annual Con-
linearizer takes this text as input and outputs a text file containing         ference, pages 87–96, Amsterdam, The Netherlands, The Netherlands, 2010. IOS
one linearized sentence per line. The linguistic feature extrac-               Press.
tor converts each sentence into a feature vector of n-grams and            [8] DOCKER. https://www.docker.com/. Accessed: 2019-01-24.
                                                                           [9] M. Dragoni, S. Villata, W. Rizzi, and G. Governatori. Combining NLP Approaches
features derived from a dependency parse.                                      for Rule Extraction from Legal Documents. In 1st Workshop on MIning and REa-
    An API call to the Docker container of the AllenNLP service                soning with Legal texts (MIREL 2016), Sophia Antipolis, France, Dec. 2016.
                                                                          [10] M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. E. Peters,
is then made with a JSON file containing all sentences identified              M. Schmitz, and L. Zettlemoyer. Allennlp: A deep semantic natural language
as being of the target deontic type or types (e.g., absolute). The             processing platform. CoRR, abs/1803.07640, 2018.
AllenNLP output is passed to the template instantiation stage. The        [11] Grobid (or grobid) means GeneRation of BIbliographic data. https://grobid.
                                                                               readthedocs.io/en/latest/. Accessed: 2018-12-18.
final output consists of CSV and HTML files that can be loaded into       [12] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The
a spreadsheet or viewed through a web browser.                                 weka data mining software: An update. SIGKDD Explorations, 11(1), 2009.
AIAS’19, June 17, 2019, MTL, CA                    Karl Branting, Jim Finegan, David Shin, Stacy Petersen, Carlos Balhana, Alex Lyte, and Craig Pfeifer


[13] S. Keerthi, S. Shevade, C. Bhattacharyya, and K. Murthy. Improvements to platt’s           (Uppercase Roman Numerals, Lowercase Roman Numerals, Upper-
     smo algorithm for svm classifier design. Neural Computation, 13(3):637–649, 2001.          case Letters, Lowercase Letters, Number Digits, Solid Bullet Points,
[14] M. Koniaris, I. Anagnostopoulos, and Y. Vassiliou. Network analysis in the legal
     domain: a complex model for european union legal sources. Journal of Complex               Hollow Bullet Points)
     Networks, 6(2):243–268, 2018.                                                              STORE the sequential order (i.e., layers) of enumeration styles en-
[15] A. Marasović and A. Frank. Multilingual modal sense classification using a con-
     volutional neural network. In P. Blunsom, K. Cho, S. B. Cohen, E. Grefenstette,
                                                                                                countered to set document convention, where each layer begins
     K. M. Hermann, L. Rimell, J. Weston, and S. W. Yih, editors, Proceedings of the 1st        with its own closet set of enumeration symbols
     Workshop on Representation Learning for NLP, Rep4NLP@ACL 2016, Berlin, Ger-                FOR lower-order layers
     many, August 11, 2016, pages 111–120. Association for Computational Linguistics,
     2016.                                                                                      CONCATENATE lines recursively with all parent layers
[16] L. Morgenstern. Toward automated international law compliance monitoring                   TERMINATE upon reaching new paragraph with no enumeration
     (tailcm). Technical report, LEIDOS, INC, 2014. AFRL-RI-RS-TR-2014-206.                     symbol at the start of the line
[17] J. O’Neill, P. Buitelaar, C. Robin, and L. O’Brien. Classifying sentential modality
     in legal language: a use case in financial regulations, acts and directives. In            ITERATE over all sections
     Proceedings of the 16th edition of the International Conference on Articial Intelligence   WRITE to [FILENAME]_paths.txt file
     and Law, ICAIL 2017, London, United Kingdom, June 12-16, 2017, pages 159–168,
     2017.                                                                                             Standardize Global Enumeration: Rewrite enumeration
[18] M. Palmer, D. Gildea, and P. Kingsbury. The proposition bank: An annotated                        conventions to standard format (e.g. I.iii.B.a. → 1.3.2.1.)
     corpus of semantic roles. Comput. Linguist., 31(1):71–106, Mar. 2005.
[19] W. Peters and A. Z. Wyner. Legal text interpretation: Identifying hohfeldian               FOR all enumerated lists,
     relations from text. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik,     REWRITE each line’s enumeration symbol with its corresponding
     B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, editors,
     Proceedings of the Tenth International Conference on Language Resources and
                                                                                                digit based on the layer order and within-layer order
     Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016. European Language               WRITE to [FILENAME]_trees.txt file
     Resources Association (ELRA), 2016.
[20] The Plain Writing Act of 2010, 2010. 111th Congress H.R. 946.
                                                                                                       Post-Process Footnotes: Add previously extracted foot-
[21] J. C. Platt. Fast training of support vector machines using sequential minimal                    notes to the bottom of document
     optimization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances
     in Kernel Methods, pages 185–208. MIT Press, Cambridge, MA, USA, 1999.
                                                                                                APPEND footnote elements to bottom of the [FILENAME]_paths.txt
[22] J. Searle. Speech Acts: An Essay in the Philosophy of Language. Cambridge Univer-          file under the new section header “Footnotes”
     sity Press, Cambridge, 1969.
[23] Tesseract ocr. https://opensource.google.com/projects/tesseract. Accessed: 2018-
     11-16.
[24] A. Wyner and W. Peters. On rule extraction from regulations. Frontiers in Artificial
     Intelligence and Applications, (235), January 2011.


9     APPENDIX: LINEARIZATION OF NESTED
      DIRECTIVES
FOR each document ingested by the linearizer:
         Preprocess: Remove footnotes to prevent splitting of
         enumerated list elements or main body sentences dur-
         ing downstream processing later in the classification
         pipeline
EXTRACT strings matching footnote format
STORE matching strings in References array
DELETE matching strings in their original positions
DELETE all multiple (n-1) vertical and horizontal spacing
         Detect Document Section Boundaries: Identify positions
         of each document section to prevent enumerated ele-
         ments from spanning multiple distinct lists.
MATCH list of known section headers
STORE matches in partition along with starting offset position for
each section in index
READ any enumerated lists in between section boundaries
         Parse and Concatenate Enumerations: Map document
         hierarchical enumeration conventions against different
         symbol sets. Concatenate all directly subordinated sen-
         tence fragments with their subordinating fragments to
         form full (flat) sentences from the enumerated elements
         for downstream processing later in the classification
         pipeline.
MATCH lines in each enumerated list within each section against
enumeration symbol style list delimited by punctuation cues