Segmentation of Rulemaking Documents for Public Notice-and-Comment Process Analysis Anna Belova∗ Matthias Grabmair Eric Nyberg abelova@alumni.cmu.edu mgrabmai@andrew.cmu.edu ehn@cs.cmu.edu Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA Pittsburgh, PA Pittsburgh, PA ABSTRACT The online forum for the US public notice-and-comment process— We evaluate feasibility of automated identification of comment regulations.gov—was launched in January 2003, as part of the US discussion passages and comment-driven proposed rule revisions eRulemaking program established as a cross-agency E-Gov initia- in the US Environmental Protection Agency’s (EPA’s) rulemaking tive under Section 206 of the 2002 E-Government Act (H.R. 2458/S. documents. We have annotated a dataset of final rule documents to 803). In this collection, all documents pertaining to the develop- identify all spans in which EPA discusses and evaluates the merits ment of a particular rule are compiled in a regulatory docket. A of public comments received on its proposed rules, and present typical docket contains a proposed rule document, many public lessons learned from the annotation process. We implement several comment documents, and a final/revised rule document.1 As such, baseline supervised discourse segmentation models that combine regulations.gov provides a testbed for study of the public notice- classic linear learners with sentence representations using hand- and-comment discourse in the US. crafted features as well as Bidirectional Encoder Representations In this work, we focus on (1) identifying spans in the final rule from Transformers (BERT). We observe good agreement on anno- documents that contain the agency’s discussion of the public com- tation comment discussions and our models achieve a classification ments it received, and (2) classifying those spans as being either F1 of 0.73. Public comment dismissals and rule revisions are substan- dismissals of the commenter claims or revisions of the proposed tially harder to annotate and predict, leading to lower agreement regulations prompted by the comment. In that, we analyze 353 US and model performance. Our work contributes a dataset and a base- Environmental Protection Agency (EPA) regulations proposed in line for a novel discourse segmentation task of identifying public January 2003 or later, and finalized as of March 2018.2 comment discussion and evaluation by the receiving agency. Our work contributes a dataset3 and a baseline for a novel dis- course segmentation task of identifying public comment discussion and evaluation by the receiving agency. Automatic detection of 1 INTRODUCTION comment discussion passages in the rulemaking documents could Government agencies are created by the legislatures worldwide to improve the efficiency of regulatory review conducted by experts at regulate social, economic, and political aspects of people’s lives. a number of organizations, including the US Office of Information These agencies belong to the executive branch of the government, and Regulatory Affairs, regulatory agencies, and other stakeholders yet they create legally enforceable regulations and rules that im- of the regulatory process. In addition, segmentation of regulatory plement broad legislation. In the US, public notice-and-comment discourse is the first step bringing agency’s narrative deliberations processes have become an important venue for influencing social in the study of bureaucratic politics and decision making (e.g., reg- and economic policy. In that, US agencies publish proposed rules ulatory capture theory) by economists and political scientists [37], in the Federal Register (FR) and all interested parties are given which to date has relied on structured data generated by surveys an opportunity to comment. Agency regulatory proposals receive and administrative record-keeping (e.g. permitting, inspections). public feedback from individuals, businesses, organized groups (of individuals or businesses), and other agencies. Comments represent 2 RELATED WORK heterogeneous interests in particular regulatory outcomes. The In the peer-reviewed literature, discussion of e-rulemaking benefits, agency is not obliged to react to each individual received comment. challenges, and related artificial intelligence (AI) methods began However, it has to respond to comments that raise significant issues in the early 2000s [9]. Over a decade later, surveys by [7] and [37] with the proposed rule and, if the points raised have merit, may describe several e-rulemaking initiatives that involved successful substantively revise of the rulemaking document. The final rule applications of AI. One line of e-rulemaking research has focused document is published in the FR and contains the discussion of on tasks relevant to management of massive amount of public com- submitted comments, or points to other documents in the docket ments received by agencies (e.g., [58], [31], [52]). Another line of that address concerns raised in the comments. ∗ Corresponding author 1 Other documents, such as transcripts of public hearings, technical support documents, detailed comment response documents, copies of pertinent scientific papers, e-mails and other correspondence, may also be included. Finally, a docket may also contain In: Proceedings of the Workshop on Artificial Intelligence and the Administrative State tabular data and software source code used to produce analytical results. (AIAS 2019), June 17, 2019, Montreal, QC, Canada. 2 We have chosen to focus on EPA because this agency published the most rules (∼ 20% © 2019 Copyright for this paper by its authors. Use permitted under Creative Commons of all rule documents) and received the most comment submissions (∼ 10% of all License Attribution 4.0 International (CC BY 4.0). comment documents) in regulations.gov during the studied time period. 3 The data and code are available at https://github.com/mug31416/PubAdmin- Published at http://ceur-ws.org Discourse.git Conference’17, July 2017, Washington, DC, USA Belova, Grabmair, and Nyberg research, conducted as part of Cornell University’s RegulationRoom 3 DATA project, has focused on tools to improve the quality of public dis- 3.1 Rule-Making Documents course around rulemaking (e.g., [41], [46]). Research on the text of rules developed by agencies has mostly focused on the search We work with the EPA’s final rule documents that are part of the for similar rules in the FR [33], rather than segmentation of the FR. Along with a summary, each of our documents can contain one comment-related discourse in the rule documents. or more of the following sections: regulatory background, scope of Prior to launch of regulations.gov, work on e-rulemaking the regulation, rationale for action, technical material describing used several rule-specific comment collections that were either the regulatory requirements, responses to public comments on the shared by the agencies—EPA, Fish and Wildlife Service (FWS)—or proposed regulation, statutory and executive order review, and gathered as part of the RegulationRoom experiments in collabora- legal references. We are interested in automated identification of all tion with the US Department of Transportation (DOT). The tasks passages where the agency discusses public comments, which could have included near duplicate detection to address mass comment occur throughout the document and are not necessarily confined campaigns [58], comment topic modeling [5, 8, 30, 51, 59], stake- to the comment response section. holder attitude identification [1, 31], and presence of substantive We note that the structure of the final rule documents can vary points in public comments [2, 44, 45, 57]. The RegulationRoom significantly depending on whether it has been produced by the project has generated a number or papers on argument mining EPA headquarters or a regional office, as well as depending on the and conflict detection within comments [29, 34, 43]. These research specific EPA office (e.g., Office of Water, Office of Air and Radiation). efforts have focused on examining only a few regulatory proceed- For example, rule documents produced by the headquarters offices ings at a time, whereas we evaluate a signifcantly larger dataset are usually major federal regulations that tend to be long and receive containing hundreds of rule documents. significant public feedback. On the other hand, rule documents More recent work on e-regulation has analyzed public comment produced by regional offices tend to be shorter.4 data collected by regulations.gov [13, 14, 35, 37, 50, 52], rule- It should be noted that our dataset only contains final rule docu- specific data from the Canadian government [53], and data from the ments as published in the FR. It does not include submitted comment White House e-petition platform [15, 19–21]. The tasks addressed documents, technical support documents, or detailed, dedicated in this body of work are topic modeling [15, 20, 21, 35, 37, 52, 53], comment response documents that are part of the docket but extra- sentiment analysis [13, 14, 37, 50], named entity recognition [20], neous to the register. and social network analysis [19]. 3.1.1 Task 1: Detecting Comment Discussions. In the first task, we Segmentation of text into discourse units [38] is a core natural want to identify the spans in the document where the EPA discusses language task. Many downstream tasks, such as information ex- submitted public comments. Examples of a comment discussion traction [27], sentiment analysis [3], information retrieval [16], and include: summarization [4, 36], can benefit from discourse segmentation. • Descriptions of comments received by the agency. For exam- Because lexical and syntactic text properties form important dis- ple, “EPA received comments suggesting that the definition course clues [6], many segmentation methods rely on hand-crafted of clean alternative fuel conversion should be limited to a features to capture them [17, 26]. Classic learning frameworks that group of fuels with proven emission benefits.”; have been used for discourse segmentation are linear Support Vec- • Descriptions of the agency’s responses to the comments it tor Machines (SVM) [11] and linear-chain Conditional Random receives. For example, “ EPA believes however that the public Fields (CRF) [32]. interest is better served by a broader definition that allows for One of the key challenges in discourse segmentation develop- future introduction of innovative and as-yet unknown fuel ment is the dearth of annotated data, which, until recently, pre- conversion systems. EPA is therefore finalizing the proposed vented the use of neural architectures. Effective neural discourse definition of clean alternative fuel conversion...”. segmentation methods [22, 56] have relied on word representations obtained from an external neural model trained to perform a re- By distinction, we are not interested in: lated task using a large corpus [39, 49]. The state-of-the art neural • Summarized feedback from petitions (as opposed to public discourse segmentation framework [18, 56] has employed a Bidirec- comments) to the agency; tional Long-Short-Term Memory-CRF architecture (BiLSTM-CRF) • Descriptions of the public comments on another rule; [25] with an attention mechanism [55]. • Statements such as “we received no comments”; For our baseline model development, we have combined several • Passages discussing revisions of a regulatory standard rather classic learning methods with hand-crafted, as well as neural sen- than revisions of the proposed rule; tence representations, from Bidirectional Encoder Representations • Referrals to another document in the docket with detailed from Transformers (BERT) [12], which were trained on English responses to comments. Wikipedia (2,500 million words) and BooksCorpus (800 million 3.1.2 Task 2: Classification of Comment Merit. In the second task, words) [60] using masked language and next sentence prediction we want to classify each comment discussion span as to whether objectives. BERT representations have demonstrated to perform the discussed comment prompted a change in the final rule from well on a wide range of natural language processing tasks. We also the proposed rule. As such, we are considering three categories: explore whether fine-tuning of BERT on the unlabeled documents in our corpus improves performance. 4 With the possible exception of the regional air quality rules that still tend to attract considerable public attention Segmentation of Rulemaking Documents Conference’17, July 2017, Washington, DC, USA passages in which the agency indicates a revision of the rule based Without context, it is unclear whether this sentence has anything on a public comment, passages in which the agency dismisses to do with comments at all, let alone whether required vs. optional a comment, and neutral comment discussion passages (i.e., the compliance results in it agreeing with, or dismissing, the comment’s passages in which the agency neither dismisses the comment nor arguments. indicates a revision). Examples of formulations reflecting comment-based regulatory 3.2 Acquisition and Sampling change are rule revisions and rule withdrawals: We have created our corpus from regulations.gov data by se- • “To address concerns about space limitations, EPA will allow lecting EPA regulatory dockets for rules proposed in January 2003 the label information to be logically split between two labels or later and finalized as of March 2018. Our selection has been that are both placed as close as possible to the original Vehicle constrained to dockets containing at least one proposed rule docu- Emission Control Information (VECI) or engine label.” ment, at least one final rule document, and at least one comment • “EPA agrees and is including use of this procedure in the OBD document. Our corpus contains 1,566 EPA dockets (meta-data 8.8 demonstration requirement for intermediate age vehicles.” MB), 2,645 final rule documents (HTML, 376 MB), 2,531 proposed • “The EPA has reviewed the new data submitted by the com- rule documents (HTML, 400 MB), and 282,655 comment documents menter and used these data to determine the revised MACT (85% PDF, 36 GB; 15% plain text, 836 MB). floor for continuous process vents at existing sources.” For the purposes of exhaustive rule document annotation, we • “EPA received one adverse comment from a single Com- have used stratified random sampling at the docket level to select menter on the aforementioned rule. As a result of the com- two development docket sets (dev1 and dev2) and one test docket ment received, EPA is withdrawing the direct final rule ap- set. The sampling procedure has ensured that the docket sets are a proving the aforementioned changes to the Alabama SIPs.” representative mix of EPA program offices and regions.5 As such, Examples of comment dismissals without a subsequent regula- we have obtained 75 dev1-set dockets (116 documents), 76 dev2-set tory change are: dockets (136 documents), and 73 test-set dockets (99 documents). In our qualitative examination of the regulatory documents, we • “We disagree that our action to approve California’s mobile have found that the section headers of the rule documents are source regulations that have been waived or authorized by often informative about whether a section contains a discussion of the EPA under CAA section 209 is inconsistent with the public comments. To make use of this additional information, we Ninth Circuit’s decision...” have applied the same random sampling procedure to the remaining • “EPA is finalizing the conversion manufacturer definition as dockets to obtain 211 training dockets (817 training documents) and proposed.” 103 validation dockets (197 validation documents) for the section • “While we agree with the commenter that pressure release header annotation. from a PRD constitutes a violation, we will address this in a separate rulemaking...” • “In the final rule we will clarify our position...” 3.3 Preprocessing • “EPA appreciates support from the commenters for this ini- The rule documents were processed in two steps. First, we have tiative and agrees that the rule makes it possible for EPA to applied a rule-based rule document parsing procedure to delete process the TRI data more quickly.” tables, split the text into sections, and retrieve section titles of • “EPA believes that no further response to the comment is the first and second level super-sections. This procedure exploits necessary...” the regular structure of documents to create heuristics applicable We observe that this task requires considerably more complex to roughly 90% of documents.6 When exceptions to the standard inference, potentially spanning multiple sections of the document. structure are detected, we manually fixed irregularities to enable As seen in the examples above, comment dismissals range from automatic parsing. Second, the section text has been split into very obvious to rather subtle. In turn, determinations of whether sentences, tokenized, and lemmatized using SpaCy [24]7 . a rule was materially revised based on the public comments may also require a clear understanding of what was proposed in the first 3.4 Annotation place. 3.4.1 Rule Documents. We hired ten students from Carnegie Mel- An extreme example of this can be seen from the following lon University and the University of Pittsburgh to perform the comment dismissal sentence: annotation tasks during the period of February 2019–April 2019. “Certain aspects of good engineering judgment described in the All annotators are at least second year undergraduate students. Five exhaust control system, evaporate control system, and fuel delivery of the annotators are masters students in fields including computer control system sections may be approached differently than described science, public health, product management, and international rela- above, but EPA expects that test data demonstrating compliance is tions. The other five are undergraduate students in civil engineering, required rather than optional in such cases.” creative writing, business, and human computer interaction. The sentence responds to technical objections to a regulation 5 For example, Office of Water/Headquarters, Office of Air and Radiation/Region 1 – by conceding that alternatives are valid (“may be approached dif- ferently”) but goes on to state the substantive decision in domain Boston. 6 For example, the first and the second level sections are numbered consecutively in terminology (“compliance is required rather than optional”, suggest- Roman numbers and Latin letters, respectively. ing that the comment had advocated for the “optional” alternative). 7 Version 2.0.18 (model en_core_web_sm) Conference’17, July 2017, Washington, DC, USA Belova, Grabmair, and Nyberg The annotators were trained to perform the two tasks described 4.1 Handcrafted Features in Section 3.1.1 and Section 3.1.2. For the first task, each annotator For sentence representation we concatenate three categories of received an hour-long in-person training as well as individualized handcrafted features. First, we featurized the text of the sentence feedback on a set of four training documents. For the second task, for which the prediction needs to be made, as well as the text of the the guidelines were delivered via a video. Each annotator received preceding sentence, and concatenate the feature vectors. We use 50 documents on average, including reliability annotations. The original tokens (including stop words, but excluding punctuation), documents were allocated such that each annotator worked on a modified tokens with attached POS tags, bigrams of modified tokens, balanced mix of documents from different EPA offices, regions, and and bigrams of POS tags.13 We apply feature hashing [40] to reduce dev1/dev2/test set dockets. The annotations were performed using dimensionality. This results in a feature set of size 2,001. an online tool developed by a collaborating group at the University Second, we featurized the text of the section header containing of Pittsburgh called Gloss. the sentence in question. In that, we apply the same feature genera- Finally, we note that some annotators did not complete all as- tion process used for sentences to the text of the sentence-bearing signments for the segmentation task, leading to some redistribution section header and the header that precedes it. The dimension of of work. The comment response classification task was completed this feature set is 101. by eight annotators of the initial ten annotators. Third, we also add a binary flag equal to one if a header of the section in which the sentence occurs has been predicted to con- 3.4.2 Section Headers. Annotation of the section headers was per- tain a comment discussion. We generate these predictions through formed by a sole expert annotator (the first author). To this end, instance-based learning on the unique section headers from the all unique section titles were extracted along with three samples training set of dockets set aside for this purpose (see Section 3.4.2). of the first paragraph following the section title. These examples Based on the unique headers from the associated validation docket are used to judge whether a section contains comment discussion: set, this signal mining procedure has a recall of 0.54 and a precision If all three sample paragraphs include comment discussions, the of 0.88. section title is flagged as the comment-discussion-indicative title.8 4.2 Neural Features 4 METHODS We employ BERT[12] to create embedded vector representations To generate baseline results, we use a classic linear SVM9 and for sentences and section headers. BERT is a state of the art neural linear-chain CRF10 learners to segment the rule documents into network language model trained on a large collection of English text spans that contain public comment discussion and merit evalu- in a quasi-unsupervised fashion by having it learn to predict masked ation by the agency.11 The benefit of the CRF over the SVM is words in a sentence, or to classify whether one sentence follows that, when predicting a sentence label, it takes into account the another, or not. By doing so, BERT learns to maintain a neural label of the prior and subsequent sentence in addition to the focal representation of language context. These vector representations sentence’s feature vector. In addition, to understand the impact of of English text can then be used as for various natural language incorporating feature interactions, we conduct experiments with processing tasks and have been shown to yield significantly better the Multi-Layer-Perceptron (MLP)[23].12 performance than context-independent word embeddings. We estimate three binary sentence-level models predicting whether As in case of the hand-crafted features, we concatenate both the a given sentence contains: (i) a public comment discussion, (ii) a vectors of the sentence/header in question as well as the context dismissal of a public comment by the agency, and (iii) an agency represented by the preceding sentence/header to form a final fea- decision to revise the proposed rule based on the public comments. ture vector. We explore performance of the available pretrained For the CRF modeling, a training instance is a sequence of sen- BERT model as well as a BERT model that has been fine-tuned on tences within the rule document section boundaries. To address the approximately 6,000 rule documents from our corpus that have label sparsity for the comment dismissal/revision classification, we not been included in the annotated document sets. To this end, we explore the utility of training models only on data that is known to rely on a PyTorch[47] implementation of BERT.14 The size of the contain comment discussion (i.e. on the non-ignorable sentences) generated sentence/header embedding is 728. The fine-tuned model and then composing a two tiered model to first detect comment was trained for seven epochs. discussions, and then then classify their polarity. The hyperparam- eters have been tuned by fitting the models to the dev1-set and 5 EVALUATION evaluating results on the dev2-set. We evaluate the quality of the rule document annotation using Co- hen’s kappa coefficient [10], as well as qualitatively. Performance of 8 For example, there were several first level section titles “What comments did EPA our baseline text segmentation models is evaluated on the test set at receive?”. 9 We use scikit-learn version 0.20.2 SVC implementation [48] with an error term penalty the sentence level using area under the ROC curve (AUC), F1-score, parameter of 1, and 1,500 as the maximum number of iterations. precision, and recall. We found a sentence to be the most meaning- 10 We use PyStruct 0.3.2 implementation [42] of margin re-scaled structural SVM ful operational definition of a passage, because comment-discussing using the 1-slack formulation and cutting plane method [28]. We used regularization parameter of 0.1 and 1,500 as the maximum number of iterations. 13 We do not use a TFIDF feature representation because it has not performed as well 11 We have been unable to fit kernelized polynomial and RBF SVMs to our data because as a simple count-based featurizer in our preliminary experiments. these methods do not scale well to the size of our dataset. 14 PyTorch Pretrained BERT: The Big and Extending Repository of pretrained Trans- 12 We use a scikit-learn version 0.20.2 MLP implementation [48] with one hidden layer formers from https://github.com/huggingface/pytorch-pretrained-BERT. We used the of 100 units optimized for at most 100 epochs at the default settings. bert-base-uncased version of the model. Segmentation of Rulemaking Documents Conference’17, July 2017, Washington, DC, USA sentences are often interspersed with ignorable sentences of a sec- Table 1: Characteristics of the Annotated Data tion or a paragraph. For each model, the classification cutoff has been determined using a threshold that maximizes the F1-score on Characteristic Dev1- Dev2- Test- the training data. set set set Number of the Data Set Elements 6 RESULTS Dockets 75 76 73 6.1 Annotation Documents 116 136 99 Table 1 summarizes the key properties of the annotated dataset. Sections 2,197 2,123 1,766 For this summary, we have converted span-level annotations into Sentences 72,969 61,837 61,042 sentence-level annotations. To this end, we have assigned a label to Words 1,820,619 1,583,518 1,430,134 a sentence if an annotator has marked 80% of tokens that make up Number of the Annotated Sentences that sentence. For documents that have been annotated by multiple Non-ignorable content 19,465 20,105 12,979 individuals, we assign a label to a sentence if at least one individual Comment dismissals 3,527 3,225 2,202 has labeled the sentence. This approach has been motivated by a Comment-based regulatory change 2,092 1,015 1,088 qualitative examination of annotations, which revealed low recall is- sues for some annotators. Depending on the dataset, non-ignorable Number of the Double-Annotated Sentences content (i.e. text labeled as discussing comments) comprises 21% Non-ignorable content* 42,296 25,300 41,572 to 33% of all sentences, comment dismissals comprise 4% to 5% Refined content** 33,595 18,561 32,331 of all sentences, and comment-based revisions comprise 2% to 3% Annotator Agreement (Kappa) of all sentences. Approximately half of all labeled sentences have Non-ignorable sentences* 0.42 0.52 0.67 been annotated by two individuals. Due to the annotator attrition, Non-ignorable sentences** 0.38 0.43 0.64 reliability annotations for a more refined labeling task (i.e., identifi- Neutral comment discussion 0.39 0.44 0.66 cation of comment dismissals and comment-based rule revisions) Comment dismissals 0.32 0.18 0.29 are available for 73% to 79% of all double-annotated sentences. Comment-based regulatory change 0.086 0.19 0.16 Table 1 also reports the inter-annotator agreement statistics, Multi-class 0.33 0.38 0.56 while Table 2 summarizes agreement with the expert annotator Notes: * Sentences for which double annotation of non-ignorable on four final rule documents used as part of the annotator train- content is available. ** Sentences for which double annotation of ing. (Expert annotations have been produced by the first author, content is also available. who has 10 years of professional experience in supporting EPA’s regulatory proposal development.) For the non-ignorable content, Table 2: Annotator Agreement* with Expert inter-annotator agreement scores range from 0.38 to 0.67 (depend- ing on the dataset), whereas agreement with the expert is 0.74 on average (range: 0.35–0.95). We note that agreement on this task ap- Kappa Mean Min Max pears to improve from the dev1 set to the test set, which may reflect Non-ignorable content 0.74 0.35 0.95 that the annotators learned to do the task better over time, given the Comment dismissals** 0.33 0 0.54 order in which the documents have been assigned. Inter-annotator Comment-based revisions** 0.38 0 0.75 agreement for the comment dismissal labeling task ranges from Notes: * Agreement is calculated at the sentence level for four final 0.18 to 0.32, while agreement on the comment-based rule revisions rule documents. A total of 4,105 sentences are available for this is very low, ranging between 0.086 and 0.19. Agreement with the evaluation. ** These statistics are calculated for the eight expert on these tasks is also low: 0.33 (range: 0–0.54) for the com- annotators who performed the task. ment dismissals and 0.38 (range: 0–0.75) for the comment-based rule revisions. We have reviewed the annotator errors vis-a-vis the expert an- notator. False negatives tend to occur most commonly when: passage requires complex inference. As such, the annotators tended to be conservative about assigning these labels for • The annotator captures only the initial part of the com- less obvious examples. ment discussion that contains typical lexical cues (e.g., “EPA received comments suggesting...”, “Commenters noted...”, For the false positives, we have observed the following tenden- “EPA agrees with the commenters...”) but fails to include cies: the entire—usually technical—comment discussion that can • EPA regulations are typically incremental, in that they often span multiple subsequent paragraphs; tend to modify older, preexisting rules. Therefore, the final • A passage with comment discussion is “buried” in the middle and proposal rule document discuss changes/ revisions of of a longer paragraph, as often happens when comments are the prior regulatory standard. This has been a significant discussed in the background section; source of confusion for the annotators, who found it diffi- • For the more difficult annotation task of identifying comment- cult to separate comment-based revisions of the proposed based rule revisions and comment dismissals, we have noted regulation from the revisions of the regulatory standard on that false negatives tend to occur when the evaluation of the the regulatory agenda, leading to false positives. Conference’17, July 2017, Washington, DC, USA Belova, Grabmair, and Nyberg • Another challenge for the annotators has been the decision Table 3: Baseline Test Set Results of when the discussion switches from comment-related to the general topics, also leading to false positives. Model AUC F1 Prec. Recall • Specifically for the comment-based rule revisions, some an- All Non-ignorable Content notators found it challenging to distinguish between revi- sions of the proposed rule that were based on comments Random 0.501 0.200 0.164 0.256 from revisions that occurred for other reasons. For example, CRF+HCF n.a 0.717 0.750 0.687 the EPA may implement revisions based on new evidence SVM+HCF 0.911 0.716 0.734 0.698 that emerges after the proposed rule is submitted for public SVM+BERT (as is) 0.921 0.695 0.721 0.672 SVM+HCF+BERT (as is) 0.915 0.689 0.753 0.636 review. SVM+BERT (tuned) 0.928 0.703 0.764 0.651 SVM+HCF+BERT (tuned) 0.913 0.693 0.709 0.677 6.2 Classification Results Comment Dismissals Table 3 and Table 4 show the test set evaluation performance results Random 0.502 0.020 0.017 0.023 for each binary classification task divided by learning framework Semi-Random 0.811 0.152 0.162 0.144 and feature set. The models have produced better than random pre- dictions, with largest AUC of 0.937 noted for the non-ignorable con- CRF+HCF n.a. 0.209 0.225 0.195 SVM+HCF 0.760 0.258 0.177 0.478 tent prediction and smallest AUC of 0.677 noted for the comment- SVM+BERT (as is) 0.869 0.277 0.194 0.484 based rule change prediction. These patterns largely reflect the SVM+HCF+BERT (as is) 0.862 0.258 0.170 0.537 differences in the quality of annotations obtained for our prediction SVM+BERT (tuned) 0.869 0.278 0.196 0.478 tasks, with the segmentation task being significantly easier than SVM+HCF+BERT (tuned) 0.768 0.257 0.191 0.393 the comment response classification task. 2-SVM+HCF 0.872 0.281 0.202 0.460 For the non-ignorable content prediction, the models produce 2-SVM+BERT (as is) 0.874 0.286 0.214 0.432 recall in the range of 0.636–0.708 and precision in the range of 2-SVM+HCF+BERT (as is) 0.881 0.257 0.214 0.322 0.688–0.798. Unsurprisingly, for the more complex annotation tasks 2-SVM+BERT (tuned) 0.888 0.318 0.249 0.441 with low annotator agreement, classification quality is poor. For the 2-SVM+HCF+BERT (tuned) 0.830 0.271 0.212 0.375 comment dismissal prediction, recall is 0.085–0.537 and precision is Comment-based Regulatory Change 0.091–0.249, whereas for the comment-based rule change prediction, Random 0.503 0.038 0.032 0.046 recall is 0.065–0.490 and precision is 0.056–0.189. Semi-Random 0.678 0.050 0.053 0.048 6.2.1 Linear Model Analysis. CRF model results do not appear to CRF+HCF n.a. 0.088 0.091 0.085 be materially different from those generated by the SVM model SVM+HCF 0.677 0.092 0.056 0.273 on the same handcrafted feature set, even through they take into SVM+BERT (as is) 0.802 0.126 0.074 0.420 account the labels of neighboring sentences. We note, however, SVM+HCF+BERT (as is) 0.736 0.099 0.058 0.335 SVM+BERT (tuned) 0.815 0.125 0.077 0.337 that the CRF models have produced consistently higher precision SVM+HCF+BERT (tuned) 0.754 0.091 0.051 0.446 scores, compared to the SVM models estimated on the same feature 2-SVM+HCF 0.724 0.081 0.091 0.073 set. Because we experienced some convergence problems with CRF 2-SVM+BERT (as is) 0.796 0.104 0.112 0.097 models, we have fit them to only one feature set. 2-SVM+HCF+BERT (as is) 0.745 0.078 0.075 0.081 Table 3 also shows that neural BERT features on average tend to 2-SVM+BERT (tuned) 0.808 0.086 0.128 0.065 generate higher AUC, precision, and recall. We note that the two- 2-SVM+HCF+BERT (tuned) 0.744 0.108 0.088 0.138 tiered models perform better for the comment dismissal prediction, Notes: Random – predictions are draws from a Bernoulli distribution with probability but not for the comment-based revision prediction. In the latter case, set to the target class prior. Semi-Random – predictions are generated by first applying the gains in precision are minor and do not offset the significant the best-performing non-ignorable content classifier and then drawing from a losses in recall. Bernoulli distribution with probability set to the target class conditional prior. 2-SVM We also observe that neural features based on the fine-tuned – a two-tiered SVM model. HCF – hand crafted features. AUC – area under the ROC BERT can perform better than those using out-of-the-box BERT curve. CRF model does not produce confidence scores, hence AUC estimation was not (e.g. best AUC and precision on non-ignorable content prediction). possible. The classification cutoff was chosen to maximize F1 score for each model. Interestingly, combining neural and handcrafted feature sets gener- ally does not produce synergy performance increases, which could obtained on for the MLP with an identity transformation (MLP-Id) be due to the substantial increase in the overall feature dimension, before the final softmax.15 or the lack of feature interaction capacity in linear models. We observe that nonlinear models using BERT features can achieve somewhat higher AUC and F1 scores than the linear models 6.2.2 Multi-Layer Perceptron Results. In a second set of experi- shown in Table 3. We also see that adding handcrafted features to ments we assessed whether classification performance increases with models that allow for feature interactions. To this end, we 15 We have also obtained results for the MLP with an a Rectified Linear Unit (ReLU) trained a series of Multi-Layer-Perceptron models (i.e. a neural net- activation function before the final softmax MLP-ReLU. The practical difference is that a ReLU activation will truncate all incoming negative activation values to 0 and leave work with one hidden layer of size 100 and a two-class softmaxed positive ones unchanged. We do not report these results because they were largely output) on our tasks and feature sets. Table 4 contains the results we inferior to those obtained for the MLP-Id variant. Segmentation of Rulemaking Documents Conference’17, July 2017, Washington, DC, USA a model can occasionally yield some performance synergy. From False Positives: The models tend to produce false positives when this we infer that nonlinear models could potentially produce bet- sentences contain certain trigger words (such as “response”, “re- ter results on our dataset, and hence we plan to experiment with vision”, “finalizing the rule as proposed”) yet the overall context recurrent or dilated convolutional models for sequence tagging to of the passage is not related to the discussion of public comments. leverage the document context in future work. For example, these trigger words have been observed in passages discussing petitions and revisions of the regulatory standard that are not based on comments, similar to mistakes made by human Table 4: Auxiliary Test Set Results annotators. There is also a fair share of label noise: As noted earlier, the annotators have been challenged by longer comment discus- Model AUC F1 Prec. Recall sions and occasionally failed to capture the entire relevant span. All Non-ignorable Content We also conjecture that in this case the models have been guided by the section-header related signal. Random 0.501 0.200 0.164 0.256 MLP-Id+HCF 0.911 0.711 0.776 0.656 False Negatives: The false negatives tend to occur in sections that MLP-Id+BERT (as is) 0.917 0.678 0.688 0.669 do not commonly contain comment discussion (e.g., “Background”, MLP-Id+HCF+BERT (as is) 0.930 0.731 0.798 0.674 “Executive Order Review”). Sentences that lack the boilerplate lan- MLP-Id+BERT (tuned) 0.937 0.705 0.772 0.648 guage (e.g., “response”, “EPA”, “comment”) also tend to be missed MLP-Id+HCF+BERT (tuned) 0.930 0.732 0.759 0.708 more often. As with the false positives, we observed some amount Comment Dismissals of label noise, often in cases when the annotators mislabeled discus- Random 0.502 0.020 0.017 0.023 sions of regulatory revisions that have not been driven by public Semi-Random 0.811 0.152 0.162 0.144 feedback or when annotators have failed to determine an appropri- ate boundaries for the technical discussion of comments. MLP-Id+HCF 0.851 0.289 0.208 0.476 MLP-Id+BERT (as is) 0.875 0.273 0.204 0.410 Label Confusion: We have observed several cases of the models MLP-Id+HCF+BERT (as is) 0.871 0.297 0.213 0.492 being confused about the polarity of EPA assessment, particularly MLP-Id+BERT (tuned) 0.893 0.284 0.209 0.442 when the sentence has included trigger words such as “agree” and MLP-Id+HCF+BERT (tuned) 0.825 0.291 0.212 0.460 “disagree” together. 2-MLP-Id+HCF 0.850 0.284 0.229 0.374 2-MLP-Id+BERT (as is) 0.859 0.281 0.218 0.394 Parsing: We have noted several instances of erroneous sentence 2-MLP-Id+HCF+BERT (as is) 0.887 0.301 0.243 0.395 parsing (e.g., a citation “40 CFR 51.1010(b).” has been isolated as a 2-MLP-Id+BERT (tuned) 0.890 0.309 0.240 0.432 sentence) that lead to classification errors. This issue could be reme- 2-MLP-Id+HCF+BERT (tuned) 0.882 0.294 0.239 0.383 died by a sentence boundary detector oriented towards processing Comment-based Regulatory Change legal text [54]. Random 0.503 0.038 0.032 0.046 Semi-Random 0.678 0.050 0.053 0.048 7 DISCUSSION MLP-Id+HCF 0.718 0.103 0.061 0.329 It is likely possible to automatically identify certain type of con- MLP-Id+BERT (as is) 0.818 0.121 0.072 0.384 tent in regulatory documents with irregular structure. Our baseline MLP-Id+HCF+BERT (as is) 0.766 0.113 0.068 0.335 segmentation performance for detecting comment discussion sen- MLP-Id+BERT (tuned) 0.837 0.138 0.086 0.343 tences with recall in the range of 0.636–0.708 and precision in the MLP-Id+HCF+BERT (tuned) 0.806 0.114 0.065 0.490 range of 0.688–0.798. While we have focused on identifying com- 2-MLP-Id+HCF 0.757 0.078 0.093 0.068 ment discussion by the receiving agency, we believe that there are 2-MLP-Id+BERT (as is) 0.723 0.092 0.077 0.114 other types of content (e.g., regulatory requirements) automated 2-MLP-Id+HCF+BERT (as is) 0.770 0.092 0.112 0.078 segmentation of which may be both, desired and feasible. Detect- 2-MLP-Id+BERT (tuned) 0.766 0.123 0.189 0.091 ing specific comment discussions that either dismiss comments 2-MLP-Id+HCF+BERT (tuned) 0.789 0.130 0.113 0.154 or announce rule revision turns out to be a harder task for both Notes: Random – predictions are draws from a Bernoulli distribution with probability annotators and, consequently, for models. Moving forward, this set to the target class prior. Semi-Random – predictions are generated by first applying the best-performing non-ignorable content classifier and then drawing from begs the question of which information need the model caters to. If a Bernoulli distribution with probability set to the target class conditional prior. value is added by quickly pointing an expert to comment discussion 2-MLP – a two-tiered MLP model. HCF – hand crafted features. AUC – area under the passages, then a well-performing model is within reach given good ROC curve. MLP-Id – a multi-layer perceptron with one hidden layer with 100 units training data. On the other hand, an automated analysis of topics and an identity non-linearity followed by a Softmax; this model is equivalent to a for which comments have been influential remains a hard problem. generalized linear regression model with interaction terms. The classification cutoff We also note that our dataset has been compiled using highly was chosen to maximize F1 score for each model. educated non-expert annotators. We have found that this type of background is sufficient for producing relatively coarse annota- tions (e.g., identifying parts of the document that contain comment 6.2.3 Error Analysis. For our best-performing models we have discussion). We have measured the annotator-expert agreement generated and examined five random examples for each type of of 0.74 for the comment discussion identification task. However, error. Our findings are as follows: more refined annotation tasks, such as the ones determining the Conference’17, July 2017, Washington, DC, USA Belova, Grabmair, and Nyberg agency’s responses to public feedback, would likely require expert- [15] Catherine Dumas, Teresa M Harrison, Loni Hagen, and Xiaoyi Zhao. 2017. What level understanding of the domain. Do the People Think?: E-Petitioning and Policy Decision Making. In Beyond Bureaucracy. Springer, 187–207. We believe that our baseline modeling results can be further [16] Yixing Fan, Jiafeng Guo, Yanyan Lan, Jun Xu, Chengxiang Zhai, and Xueqi improved by developing a fully neural sequence tagging model, Cheng. 2018. Modeling diverse relevance patterns in ad-hoc retrieval. In The 41st International ACM SIGIR Conference on Research & Development in Information such as the one developed for the standard discourse segmentation Retrieval. ACM, 375–384. corpus [56]. However, even with access to the sequence encoders [17] Vanessa Wei Feng and Graeme Hirst. 2014. Two-pass discourse segmentation such as BERT, the limited size of our corpus may still present a with pairing and global features. arXiv preprint arXiv:1407.8215 (2014). [18] Elisa Ferracane, Titan Page, Junyi Jessy Li, and Katrin Erk. 2019. From News to modeling challenge. Medical: Cross-domain Discourse Segmentation. arXiv preprint arXiv:1904.06682 (2019). [19] Loni Hagen, Teresa M Harrison, and Catherine L Dumas. 2018. Data Analytics 8 CONCLUSIONS for Policy Informatics: The Case of E-Petitioning. In Policy Analytics, Modelling, We have produced a dataset and baseline for a novel discourse and Informatics. Springer, 205–224. [20] Loni Hagen, Teresa M Harrison, Özlem Uzuner, Tim Fake, Dan Lamanna, and segmentation task of identifying public comment discussion and Christopher Kotfila. 2015. Introducing textual analysis tools for policy informat- evaluation by regulatory agencies. In doing so we presented ev- ics: a case study of e-petitions. In Proceedings of the 16th annual international idence that detecting comment discussions automatically using conference on digital government research. ACM, 10–19. [21] Loni Hagen, Özlem Uzuner, Christopher Kotfila, Teresa M Harrison, and Dan mainstream NLP techniques is feasible given good training data. Lamanna. 2015. Understanding Citizens’ Direct Policy Suggestions to the Federal Classifying discussions of a particular type is harder both because Government: A Natural Language Processing and Topic Modeling Approach. In System Sciences (HICSS), 2015 48th Hawaii International Conference on. IEEE, of data sparsity and low annotator agreement. While good general 2134–2143. detection performance will add value in some practical settings, [22] Mehedi Hasan, A Kotov, S Naar, GL Alexander, and A Idalski Carcone. 2019. we see opportunity for further improvement in the use of neural Deep neural architectures for discourse segmentation in e-mail based behavioral interventions. In American Medical Informatics Association (AMIA). sequence tagging models, albeit subject to the limitations of data [23] Geoffrey E Hinton. 1990. Connectionist learning procedures. In Machine learning. quality as a function of annotator expertise, training, and type Elsevier, 555–610. system design. [24] Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language under- standing with Bloom embeddings, convolutional neural networks and incremen- tal parsing. To appear (2017). 9 ACKNOWLEDGMENTS [25] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015). The authors thank University of Pittsburgh Intelligent Systems [26] Yangfeng Ji and Jacob Eisenstein. 2014. Representation learning for text-level Program student Jaromir Savelka for permission to use the Gloss discourse parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 13–24. annotation tool. [27] Robin Jia, Cliff Wong, and Hoifung Poon. 2019. Document-Level N -ary Re- lation Extraction with Multiscale Representation Learning. arXiv preprint arXiv:1904.02347 (2019). REFERENCES [28] Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu. 2009. Cutting-plane [1] Jaime Arguello and Jamie Callan. 2007. A bootstrapping approach for identifying training of structural SVMs. Machine Learning 77, 1 (2009), 27–59. stakeholders in public-comment corpora. In Proceedings of the 8th annual interna- [29] Barbara Konat, John Lawrence, Joonsuk Park, Katarzyna Budzynska, and Chris tional conference on Digital government research: bridging disciplines & domains. Reed. 2016. A Corpus of Argument Networks: Using Graph Properties to Analyse Digital Government Society of North America, 92–101. Divisive Issues.. In LREC. [2] Jaime Arguello, Jamie Callan, and Stuart Shulman. 2008. Recognizing citations in [30] Namhee Kwon, Stuart W Shulman, and Eduard Hovy. 2006. Multidimensional public comments. Journal of Information Technology & Politics 5, 1 (2008), 49–71. text analysis for eRulemaking. In Proceedings of the 2006 international conference [3] Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein. 2015. Better document-level on Digital government research. Digital Government Society of North America, sentiment analysis from rst discourse parsing. arXiv preprint arXiv:1509.01599 157–166. (2015). [31] Namhee Kwon, Liang Zhou, Eduard Hovy, and Stuart W Shulman. 2007. Identify- [4] Mohammad Hadi Bokaei, Hossein Sameti, and Yang Liu. 2016. Extractive sum- ing and classifying subjective claims. In Proceedings of the 8th annual international marization of multi-party meetings through discourse segmentation. Natural conference on Digital government research: bridging disciplines & domains. Digital Language Engineering 22, 1 (2016), 41–72. Government Society of North America, 76–81. [5] Claire Cardie, Cynthia R Farina, Matt Rawding, and Adil Aijaz. 2008. An erule- [32] John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional making corpus: Identifying substantive issues in public comments. (2008). random fields: Probabilistic models for segmenting and labeling sequence data. [6] Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. 2003. Building a (2001). discourse-tagged corpus in the framework of rhetorical structure theory. In [33] Gloria T Lau. 2004. A comparative analysis framework for semi-structured docu- Current and new directions in discourse and dialogue. Springer, 85–112. ments, with applications to government regulations. Stanford University. [7] Nuno Carvalho and Rui Pedro Lourenço. 2018. E-Rulemaking: Lessons from the [34] John Lawrence, Joonsuk Park, Katarzyna Budzynska, Claire Cardie, Barbara Literature. International Journal of Technology and Human Interaction (IJTHI) 14, Konat, and Chris Reed. 2017. Using argumentative structure to interpret debates 2 (2018), 35–53. in online deliberative democracy and eRulemaking. ACM Transactions on Internet [8] Lijun Chen. 2007. Summaritive digest for large document repositories with Technology (TOIT) 17, 3 (2017), 25. application to e-rulemaking. (2007). [35] Karen EC Levy and Michael Franklin. 2014. Driving regulation: using topic [9] Cary Coglianese. 2004. E-Rulemaking: Information technology and the regulatory models to examine political contention in the US trucking industry. Social Science process. Administrative Law Review (2004), 353–402. Computer Review 32, 2 (2014), 182–194. [10] Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational [36] Junyi Jessy Li, Kapil Thadani, and Amanda Stent. 2016. The role of discourse and psychological measurement 20, 1 (1960), 37–46. units in near-extractive summarization. In Proceedings of the 17th Annual Meeting [11] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine of the Special Interest Group on Discourse and Dialogue. 137–147. learning 20, 3 (1995), 273–297. [37] Michael A Livermore, Vladimir Eidelman, and Brian Grom. 2017. Computationally [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: assisted regulatory participation. Notre Dame L. Rev. 93 (2017), 977. Pre-training of Deep Bidirectional Transformers for Language Understanding. [38] Daniel Marcu. 2000. The theory and practice of discourse parsing and summariza- CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805 tion. MIT press. [13] Tao Ding and Shimei Pan. 2016. How Reliable Is Sentiment Analysis? A Multi- [39] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. domain Empirical Investigation. In International Conference on Web Information Distributed representations of words and phrases and their compositionality. In Systems and Technologies. Springer, 37–57. Advances in neural information processing systems. 3111–3119. [14] Lauren M Dinour and Antoinette Pole. 2017. Potato Chips, Cookies, and Candy [40] John E. Moody. 1988. Fast Learning in Multi-Resolution Hierarchies. Oh My! Public Commentary on Proposed Rules Regulating Competitive Foods. In Advances in Neural Information Processing Systems 1, [NIPS Confer- Health Education & Behavior 44, 6 (2017), 867–875. ence, Denver, Colorado, USA, 1988]. 29–39. http://papers.nips.cc/paper/ Segmentation of Rulemaking Documents Conference’17, July 2017, Washington, DC, USA 175-fast-learning-in-multi-resolution-hierarchies [41] Peter Muhlberger, Nick Webb, and Jennifer Stromer-Galley. 2008. The Deliberative E-Rulemaking project (DeER): improving federal agency rulemaking via natural language processing and citizen dialogue. In Proceedings of the 2008 international conference on Digital government research. Digital Government Society of North America, 403–404. [42] Andreas C. Müller and Sven Behnke. 2014. pystruct - Learning Structured Prediction in Python. Journal of Machine Learning Research 15 (2014), 2055– 2060. http://jmlr.org/papers/v15/mueller14a.html [43] Joonsuk Park. 2016. Mining and evaluating argumentative structures in user comments in eRulemaking. Cornell University. [44] Joonsuk Park, Cheryl Blake, and Claire Cardie. 2015. Toward machine-assisted participation in eRulemaking: An argumentation model of evaluability. In Pro- ceedings of the 15th International Conference on Artificial Intelligence and Law. ACM, 206–210. [45] Joonsuk Park and Claire Cardie. 2014. Identifying appropriate support for propo- sitions in online user comments. In Proceedings of the First Workshop on Argu- mentation Mining. 29–38. [46] Joonsuk Park, Sally Klingel, Claire Cardie, Mary Newhart, Cynthia Farina, and Joan-Josep Vallbé. 2012. Facilitative moderation for online participation in eRule- making. In Proceedings of the 13th Annual International Conference on Digital Government Research. ACM, 173–182. [47] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W. [48] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. [49] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018). [50] Rachel A Potter. 2017. More than spam? Lobbying the EPA through public comment campaigns. In Brookings Series on Regula- tory Process and Perspective. https://www.brookings.edu/research/ more-than-spam-lobbying-the-epa-through-public-comment-campaigns [51] Stephen Purpura, Claire Cardie, and Jesse Simons. 2008. Active learning for e-rulemaking: Public comment categorization. In Proceedings of the 2008 interna- tional conference on Digital government research. Digital Government Society of North America, 234–243. [52] Reza Rajabiun. 2015. Beyond Transparency: The Semantics of Rulemaking for an Open Internet. Ind. LJ Supp. 91 (2015), 33. [53] Reza Rajabiun and Catherine Middleton. 2015. Public Interest in the Regulation of Competition: Evidence from Wholesale Internet Access Consultations in Canada. Journal of Information Policy 5 (2015), 32–66. [54] Jaromir Savelka, Vern R Walker, Matthias Grabmair, and Kevin D Ashley. 2017. Sentence boundary detection in adjudicatory decisions in the united states. Traite- ment automatique des langues 58, 2 (2017), 21–45. [55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008. [56] Yizhong Wang, Sujian Li, and Jingfeng Yang. 2018. Toward Fast and Accurate Neural Discourse Segmentation. arXiv preprint arXiv:1808.09147 (2018). [57] Antje Witting. 2015. Measuring the use of knowledge in policy development. Central European Journal of Public Policy 9, 2 (2015), 54–62. [58] Hui Yang and Jamie Callan. 2005. Near-duplicate detection for eRulemaking. In Proceedings of the 2005 national conference on Digital government research. Digital Government Society of North America, 78–86. [59] Hui Yang and Jamie Callan. 2008. Ontology generation for large email collections. In Proceedings of the 2008 international conference on Digital government research. Digital Government Society of North America, 254–261. [60] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceed- ings of the IEEE international conference on computer vision. 19–27.