Automated Methods for Extracting and Expanding Lists in Regulatory Text Alan BUABUCHACHART, Nina CHARNESS, Katherine METCALF, and Leora MORGENSTERN Leidos, 4001 N. Fairfax Drive, Arlington, VA 22203 Abstract. This is a highly condensed version of a paper [1] that presents automated methods to accurately transform regulations with bulleted lists into sets of complete sentences that include their proper context. We discuss the technical challenges addressed, including extracting intended structure from HTML documents, and correctly distributing preambles over nested text. Our work has been used to preprocess the corpus used for our experiments in classifying paragraphs in regulatory documents by several categories, including illocutionary point, regulation type, and reference structure. That work is presented in a companion paper pub- lished in the JURIX 2013 proceedings. Keywords. text analysis, regulation, preprocessing, classification 1. Introduction Many regulatory documents contain or are largely composed of bulleted lists. Such lists break up complex text and increase comprehension for human readers, who are good at distributing preambles over bulleted text as they read. However, automated processing of such documents is difficult, especially if bulleted units are not complete sentences, and if there are multiple levels of nesting. We develop automated methods to transform text that contains bullets to text in which the bulleted text is fully expanded, with preambles distributed over the bulleted text. This work is a preliminary step in our study of the feasibility of automating the translation of regulatory text into formal, executable rules. Our approach to this general problem involves both machine learning and deep parsing techniques; we have found that distribution is a necessary first step for both tasks. As dis- cussed in [2], both the consistency of annotation/training data and the perfor- mance of clustering algorithms is superior when using expanded text rather than standard bulleted text, or bulleted text to which sentence splitting techniques [3] have been applied. Moreover, the loss of context inherent in sentence splitting suggests that expanded text will lead to more accurate parsing. 2. Motivating Example: The importance of context in reading bulleted lists Domain and corpus: We are working with a corpus of 250 United States financial regulation units. Consider, e.g., the initial fragment of FINRA Rule 3240: 3240. Borrowing From or Lending to Customers (a) Permissible Lending Arrangements; Conditions No person associated with a member in any registered capacity may borrow money from or lend money to any customer of such person unless: (1) the member has written procedures allowing the borrowing and lending of money between such registered persons and customers of the member; (2) the borrowing or lending arrangement meets one of the following conditions: (A) the customer is a member of such person’s immediate family; (B) the customer (i) is a financial institution regularly engaged in the business of providing credit ... and (ii) is acting in the course of such business; (C) the customer and the registered person are both registered persons of the same member; ..... Bulleted structure aids human comprehension by breaking up text. We under- stand that bullet (a) lists ways in which lending is allowed; that subbullet (2) specifies alternative necessary conditions constraining the relationship between customer and lender. As we read the text we must keep context in mind. [3] and [4] advocate processing bulleted text by using punctuation cues to do sentence splitting. This yields sentences such as the member has written procedures allowing the borrowing and lending of money between such registered persons and customers of the member. Such sentences are missing context and are therefore difficult to understand. 3. Extracting from HTML, Tree Building, Distributing Preambles In developing our technical approach, we address two hard problems: (1) ex- tracting bulleted structure from available text; (2) building a tree structure that supports expansion and distribution of parent preambles over child bullets for arbitrarily deep levels of nesting. We can then traverse the tree to obtain the distributed text. We recovered bulleted structure using HTML files from 6 different online law sources. Utilities like jsoup facilitate detection of paragraphs and indentation. HTML tags facilitate getting rid of junk text. Unfortunately, no source HTML files use standard bulleting tags (e.g.,