Segmentation of Rulemaking Documents for Public
                       Notice-and-Comment Process Analysis
                    Anna Belova∗                                          Matthias Grabmair                                           Eric Nyberg
            abelova@alumni.cmu.edu                                    mgrabmai@andrew.cmu.edu                                   ehn@cs.cmu.edu
           Carnegie Mellon University                                 Carnegie Mellon University                            Carnegie Mellon University
                 Pittsburgh, PA                                            Pittsburgh, PA                                        Pittsburgh, PA

ABSTRACT                                                                                     The online forum for the US public notice-and-comment process—
We evaluate feasibility of automated identification of comment                            regulations.gov—was launched in January 2003, as part of the US
discussion passages and comment-driven proposed rule revisions                            eRulemaking program established as a cross-agency E-Gov initia-
in the US Environmental Protection Agency’s (EPA’s) rulemaking                            tive under Section 206 of the 2002 E-Government Act (H.R. 2458/S.
documents. We have annotated a dataset of final rule documents to                         803). In this collection, all documents pertaining to the develop-
identify all spans in which EPA discusses and evaluates the merits                        ment of a particular rule are compiled in a regulatory docket. A
of public comments received on its proposed rules, and present                            typical docket contains a proposed rule document, many public
lessons learned from the annotation process. We implement several                         comment documents, and a final/revised rule document.1 As such,
baseline supervised discourse segmentation models that combine                            regulations.gov provides a testbed for study of the public notice-
classic linear learners with sentence representations using hand-                         and-comment discourse in the US.
crafted features as well as Bidirectional Encoder Representations                            In this work, we focus on (1) identifying spans in the final rule
from Transformers (BERT). We observe good agreement on anno-                              documents that contain the agency’s discussion of the public com-
tation comment discussions and our models achieve a classification                        ments it received, and (2) classifying those spans as being either
F1 of 0.73. Public comment dismissals and rule revisions are substan-                     dismissals of the commenter claims or revisions of the proposed
tially harder to annotate and predict, leading to lower agreement                         regulations prompted by the comment. In that, we analyze 353 US
and model performance. Our work contributes a dataset and a base-                         Environmental Protection Agency (EPA) regulations proposed in
line for a novel discourse segmentation task of identifying public                        January 2003 or later, and finalized as of March 2018.2
comment discussion and evaluation by the receiving agency.                                   Our work contributes a dataset3 and a baseline for a novel dis-
                                                                                          course segmentation task of identifying public comment discussion
                                                                                          and evaluation by the receiving agency. Automatic detection of
1     INTRODUCTION                                                                        comment discussion passages in the rulemaking documents could
Government agencies are created by the legislatures worldwide to                          improve the efficiency of regulatory review conducted by experts at
regulate social, economic, and political aspects of people’s lives.                       a number of organizations, including the US Office of Information
These agencies belong to the executive branch of the government,                          and Regulatory Affairs, regulatory agencies, and other stakeholders
yet they create legally enforceable regulations and rules that im-                        of the regulatory process. In addition, segmentation of regulatory
plement broad legislation. In the US, public notice-and-comment                           discourse is the first step bringing agency’s narrative deliberations
processes have become an important venue for influencing social                           in the study of bureaucratic politics and decision making (e.g., reg-
and economic policy. In that, US agencies publish proposed rules                          ulatory capture theory) by economists and political scientists [37],
in the Federal Register (FR) and all interested parties are given                         which to date has relied on structured data generated by surveys
an opportunity to comment. Agency regulatory proposals receive                            and administrative record-keeping (e.g. permitting, inspections).
public feedback from individuals, businesses, organized groups (of
individuals or businesses), and other agencies. Comments represent                        2     RELATED WORK
heterogeneous interests in particular regulatory outcomes. The
                                                                                          In the peer-reviewed literature, discussion of e-rulemaking benefits,
agency is not obliged to react to each individual received comment.
                                                                                          challenges, and related artificial intelligence (AI) methods began
However, it has to respond to comments that raise significant issues
                                                                                          in the early 2000s [9]. Over a decade later, surveys by [7] and [37]
with the proposed rule and, if the points raised have merit, may
                                                                                          describe several e-rulemaking initiatives that involved successful
substantively revise of the rulemaking document. The final rule
                                                                                          applications of AI. One line of e-rulemaking research has focused
document is published in the FR and contains the discussion of
                                                                                          on tasks relevant to management of massive amount of public com-
submitted comments, or points to other documents in the docket
                                                                                          ments received by agencies (e.g., [58], [31], [52]). Another line of
that address concerns raised in the comments.
∗ Corresponding author                                                                    1 Other documents, such as transcripts of public hearings, technical support documents,
                                                                                          detailed comment response documents, copies of pertinent scientific papers, e-mails
                                                                                          and other correspondence, may also be included. Finally, a docket may also contain
In: Proceedings of the Workshop on Artificial Intelligence and the Administrative State   tabular data and software source code used to produce analytical results.
(AIAS 2019), June 17, 2019, Montreal, QC, Canada.                                         2 We have chosen to focus on EPA because this agency published the most rules (∼ 20%

© 2019 Copyright for this paper by its authors. Use permitted under Creative Commons      of all rule documents) and received the most comment submissions (∼ 10% of all
License Attribution 4.0 International (CC BY 4.0).                                        comment documents) in regulations.gov during the studied time period.
                                                                                          3 The data and code are available at https://github.com/mug31416/PubAdmin-
Published at http://ceur-ws.org
                                                                                          Discourse.git
Conference’17, July 2017, Washington, DC, USA                                                                            Belova, Grabmair, and Nyberg


research, conducted as part of Cornell University’s RegulationRoom       3 DATA
project, has focused on tools to improve the quality of public dis-
                                                                         3.1 Rule-Making Documents
course around rulemaking (e.g., [41], [46]). Research on the text
of rules developed by agencies has mostly focused on the search          We work with the EPA’s final rule documents that are part of the
for similar rules in the FR [33], rather than segmentation of the        FR. Along with a summary, each of our documents can contain one
comment-related discourse in the rule documents.                         or more of the following sections: regulatory background, scope of
   Prior to launch of regulations.gov, work on e-rulemaking              the regulation, rationale for action, technical material describing
used several rule-specific comment collections that were either          the regulatory requirements, responses to public comments on the
shared by the agencies—EPA, Fish and Wildlife Service (FWS)—or           proposed regulation, statutory and executive order review, and
gathered as part of the RegulationRoom experiments in collabora-         legal references. We are interested in automated identification of all
tion with the US Department of Transportation (DOT). The tasks           passages where the agency discusses public comments, which could
have included near duplicate detection to address mass comment           occur throughout the document and are not necessarily confined
campaigns [58], comment topic modeling [5, 8, 30, 51, 59], stake-        to the comment response section.
holder attitude identification [1, 31], and presence of substantive         We note that the structure of the final rule documents can vary
points in public comments [2, 44, 45, 57]. The RegulationRoom            significantly depending on whether it has been produced by the
project has generated a number or papers on argument mining              EPA headquarters or a regional office, as well as depending on the
and conflict detection within comments [29, 34, 43]. These research      specific EPA office (e.g., Office of Water, Office of Air and Radiation).
efforts have focused on examining only a few regulatory proceed-         For example, rule documents produced by the headquarters offices
ings at a time, whereas we evaluate a signifcantly larger dataset        are usually major federal regulations that tend to be long and receive
containing hundreds of rule documents.                                   significant public feedback. On the other hand, rule documents
   More recent work on e-regulation has analyzed public comment          produced by regional offices tend to be shorter.4
data collected by regulations.gov [13, 14, 35, 37, 50, 52], rule-           It should be noted that our dataset only contains final rule docu-
specific data from the Canadian government [53], and data from the       ments as published in the FR. It does not include submitted comment
White House e-petition platform [15, 19–21]. The tasks addressed         documents, technical support documents, or detailed, dedicated
in this body of work are topic modeling [15, 20, 21, 35, 37, 52, 53],    comment response documents that are part of the docket but extra-
sentiment analysis [13, 14, 37, 50], named entity recognition [20],      neous to the register.
and social network analysis [19].                                        3.1.1 Task 1: Detecting Comment Discussions. In the first task, we
   Segmentation of text into discourse units [38] is a core natural      want to identify the spans in the document where the EPA discusses
language task. Many downstream tasks, such as information ex-            submitted public comments. Examples of a comment discussion
traction [27], sentiment analysis [3], information retrieval [16], and   include:
summarization [4, 36], can benefit from discourse segmentation.
                                                                               • Descriptions of comments received by the agency. For exam-
Because lexical and syntactic text properties form important dis-
                                                                                 ple, “EPA received comments suggesting that the definition
course clues [6], many segmentation methods rely on hand-crafted
                                                                                 of clean alternative fuel conversion should be limited to a
features to capture them [17, 26]. Classic learning frameworks that
                                                                                 group of fuels with proven emission benefits.”;
have been used for discourse segmentation are linear Support Vec-
                                                                               • Descriptions of the agency’s responses to the comments it
tor Machines (SVM) [11] and linear-chain Conditional Random
                                                                                 receives. For example, “ EPA believes however that the public
Fields (CRF) [32].
                                                                                 interest is better served by a broader definition that allows for
   One of the key challenges in discourse segmentation develop-
                                                                                 future introduction of innovative and as-yet unknown fuel
ment is the dearth of annotated data, which, until recently, pre-
                                                                                 conversion systems. EPA is therefore finalizing the proposed
vented the use of neural architectures. Effective neural discourse
                                                                                 definition of clean alternative fuel conversion...”.
segmentation methods [22, 56] have relied on word representations
obtained from an external neural model trained to perform a re-             By distinction, we are not interested in:
lated task using a large corpus [39, 49]. The state-of-the art neural          • Summarized feedback from petitions (as opposed to public
discourse segmentation framework [18, 56] has employed a Bidirec-                comments) to the agency;
tional Long-Short-Term Memory-CRF architecture (BiLSTM-CRF)                    • Descriptions of the public comments on another rule;
[25] with an attention mechanism [55].                                         • Statements such as “we received no comments”;
   For our baseline model development, we have combined several                • Passages discussing revisions of a regulatory standard rather
classic learning methods with hand-crafted, as well as neural sen-               than revisions of the proposed rule;
tence representations, from Bidirectional Encoder Representations              • Referrals to another document in the docket with detailed
from Transformers (BERT) [12], which were trained on English                     responses to comments.
Wikipedia (2,500 million words) and BooksCorpus (800 million
                                                                         3.1.2 Task 2: Classification of Comment Merit. In the second task,
words) [60] using masked language and next sentence prediction
                                                                         we want to classify each comment discussion span as to whether
objectives. BERT representations have demonstrated to perform
                                                                         the discussed comment prompted a change in the final rule from
well on a wide range of natural language processing tasks. We also
                                                                         the proposed rule. As such, we are considering three categories:
explore whether fine-tuning of BERT on the unlabeled documents
in our corpus improves performance.                                      4 With the possible exception of the regional air quality rules that still tend to attract
                                                                         considerable public attention
Segmentation of Rulemaking Documents                                                            Conference’17, July 2017, Washington, DC, USA


passages in which the agency indicates a revision of the rule based       Without context, it is unclear whether this sentence has anything
on a public comment, passages in which the agency dismisses               to do with comments at all, let alone whether required vs. optional
a comment, and neutral comment discussion passages (i.e., the             compliance results in it agreeing with, or dismissing, the comment’s
passages in which the agency neither dismisses the comment nor            arguments.
indicates a revision).
   Examples of formulations reflecting comment-based regulatory           3.2     Acquisition and Sampling
change are rule revisions and rule withdrawals:                           We have created our corpus from regulations.gov data by se-
    • “To address concerns about space limitations, EPA will allow        lecting EPA regulatory dockets for rules proposed in January 2003
      the label information to be logically split between two labels      or later and finalized as of March 2018. Our selection has been
      that are both placed as close as possible to the original Vehicle   constrained to dockets containing at least one proposed rule docu-
      Emission Control Information (VECI) or engine label.”               ment, at least one final rule document, and at least one comment
    • “EPA agrees and is including use of this procedure in the OBD       document. Our corpus contains 1,566 EPA dockets (meta-data 8.8
      demonstration requirement for intermediate age vehicles.”           MB), 2,645 final rule documents (HTML, 376 MB), 2,531 proposed
    • “The EPA has reviewed the new data submitted by the com-            rule documents (HTML, 400 MB), and 282,655 comment documents
      menter and used these data to determine the revised MACT            (85% PDF, 36 GB; 15% plain text, 836 MB).
      floor for continuous process vents at existing sources.”               For the purposes of exhaustive rule document annotation, we
    • “EPA received one adverse comment from a single Com-                have used stratified random sampling at the docket level to select
      menter on the aforementioned rule. As a result of the com-          two development docket sets (dev1 and dev2) and one test docket
      ment received, EPA is withdrawing the direct final rule ap-         set. The sampling procedure has ensured that the docket sets are a
      proving the aforementioned changes to the Alabama SIPs.”            representative mix of EPA program offices and regions.5 As such,
   Examples of comment dismissals without a subsequent regula-            we have obtained 75 dev1-set dockets (116 documents), 76 dev2-set
tory change are:                                                          dockets (136 documents), and 73 test-set dockets (99 documents).
                                                                             In our qualitative examination of the regulatory documents, we
    • “We disagree that our action to approve California’s mobile
                                                                          have found that the section headers of the rule documents are
      source regulations that have been waived or authorized by
                                                                          often informative about whether a section contains a discussion of
      the EPA under CAA section 209 is inconsistent with the
                                                                          public comments. To make use of this additional information, we
      Ninth Circuit’s decision...”
                                                                          have applied the same random sampling procedure to the remaining
    • “EPA is finalizing the conversion manufacturer definition as
                                                                          dockets to obtain 211 training dockets (817 training documents) and
      proposed.”
                                                                          103 validation dockets (197 validation documents) for the section
    • “While we agree with the commenter that pressure release
                                                                          header annotation.
      from a PRD constitutes a violation, we will address this in a
      separate rulemaking...”
    • “In the final rule we will clarify our position...”                 3.3     Preprocessing
    • “EPA appreciates support from the commenters for this ini-          The rule documents were processed in two steps. First, we have
      tiative and agrees that the rule makes it possible for EPA to       applied a rule-based rule document parsing procedure to delete
      process the TRI data more quickly.”                                 tables, split the text into sections, and retrieve section titles of
    • “EPA believes that no further response to the comment is            the first and second level super-sections. This procedure exploits
      necessary...”                                                       the regular structure of documents to create heuristics applicable
   We observe that this task requires considerably more complex           to roughly 90% of documents.6 When exceptions to the standard
inference, potentially spanning multiple sections of the document.        structure are detected, we manually fixed irregularities to enable
As seen in the examples above, comment dismissals range from              automatic parsing. Second, the section text has been split into
very obvious to rather subtle. In turn, determinations of whether         sentences, tokenized, and lemmatized using SpaCy [24]7 .
a rule was materially revised based on the public comments may
also require a clear understanding of what was proposed in the first      3.4     Annotation
place.                                                                    3.4.1 Rule Documents. We hired ten students from Carnegie Mel-
   An extreme example of this can be seen from the following              lon University and the University of Pittsburgh to perform the
comment dismissal sentence:                                               annotation tasks during the period of February 2019–April 2019.
   “Certain aspects of good engineering judgment described in the         All annotators are at least second year undergraduate students. Five
exhaust control system, evaporate control system, and fuel delivery       of the annotators are masters students in fields including computer
control system sections may be approached differently than described      science, public health, product management, and international rela-
above, but EPA expects that test data demonstrating compliance is         tions. The other five are undergraduate students in civil engineering,
required rather than optional in such cases.”                             creative writing, business, and human computer interaction.
   The sentence responds to technical objections to a regulation
                                                                          5 For example, Office of Water/Headquarters, Office of Air and Radiation/Region 1 –
by conceding that alternatives are valid (“may be approached dif-
ferently”) but goes on to state the substantive decision in domain        Boston.
                                                                          6 For example, the first and the second level sections are numbered consecutively in
terminology (“compliance is required rather than optional”, suggest-      Roman numbers and Latin letters, respectively.
ing that the comment had advocated for the “optional” alternative).       7 Version 2.0.18 (model en_core_web_sm)
Conference’17, July 2017, Washington, DC, USA                                                                                          Belova, Grabmair, and Nyberg


   The annotators were trained to perform the two tasks described                         4.1     Handcrafted Features
in Section 3.1.1 and Section 3.1.2. For the first task, each annotator                    For sentence representation we concatenate three categories of
received an hour-long in-person training as well as individualized                        handcrafted features. First, we featurized the text of the sentence
feedback on a set of four training documents. For the second task,                        for which the prediction needs to be made, as well as the text of the
the guidelines were delivered via a video. Each annotator received                        preceding sentence, and concatenate the feature vectors. We use
50 documents on average, including reliability annotations. The                           original tokens (including stop words, but excluding punctuation),
documents were allocated such that each annotator worked on a                             modified tokens with attached POS tags, bigrams of modified tokens,
balanced mix of documents from different EPA offices, regions, and                        and bigrams of POS tags.13 We apply feature hashing [40] to reduce
dev1/dev2/test set dockets. The annotations were performed using                          dimensionality. This results in a feature set of size 2,001.
an online tool developed by a collaborating group at the University                          Second, we featurized the text of the section header containing
of Pittsburgh called Gloss.                                                               the sentence in question. In that, we apply the same feature genera-
   Finally, we note that some annotators did not complete all as-                         tion process used for sentences to the text of the sentence-bearing
signments for the segmentation task, leading to some redistribution                       section header and the header that precedes it. The dimension of
of work. The comment response classification task was completed                           this feature set is 101.
by eight annotators of the initial ten annotators.                                           Third, we also add a binary flag equal to one if a header of the
                                                                                          section in which the sentence occurs has been predicted to con-
3.4.2 Section Headers. Annotation of the section headers was per-
                                                                                          tain a comment discussion. We generate these predictions through
formed by a sole expert annotator (the first author). To this end,
                                                                                          instance-based learning on the unique section headers from the
all unique section titles were extracted along with three samples
                                                                                          training set of dockets set aside for this purpose (see Section 3.4.2).
of the first paragraph following the section title. These examples
                                                                                          Based on the unique headers from the associated validation docket
are used to judge whether a section contains comment discussion:
                                                                                          set, this signal mining procedure has a recall of 0.54 and a precision
If all three sample paragraphs include comment discussions, the
                                                                                          of 0.88.
section title is flagged as the comment-discussion-indicative title.8
                                                                                          4.2     Neural Features
4     METHODS
                                                                                          We employ BERT[12] to create embedded vector representations
To generate baseline results, we use a classic linear SVM9 and                            for sentences and section headers. BERT is a state of the art neural
linear-chain CRF10 learners to segment the rule documents into                            network language model trained on a large collection of English text
spans that contain public comment discussion and merit evalu-                             in a quasi-unsupervised fashion by having it learn to predict masked
ation by the agency.11 The benefit of the CRF over the SVM is                             words in a sentence, or to classify whether one sentence follows
that, when predicting a sentence label, it takes into account the                         another, or not. By doing so, BERT learns to maintain a neural
label of the prior and subsequent sentence in addition to the focal                       representation of language context. These vector representations
sentence’s feature vector. In addition, to understand the impact of                       of English text can then be used as for various natural language
incorporating feature interactions, we conduct experiments with                           processing tasks and have been shown to yield significantly better
the Multi-Layer-Perceptron (MLP)[23].12                                                   performance than context-independent word embeddings.
   We estimate three binary sentence-level models predicting whether                         As in case of the hand-crafted features, we concatenate both the
a given sentence contains: (i) a public comment discussion, (ii) a                        vectors of the sentence/header in question as well as the context
dismissal of a public comment by the agency, and (iii) an agency                          represented by the preceding sentence/header to form a final fea-
decision to revise the proposed rule based on the public comments.                        ture vector. We explore performance of the available pretrained
For the CRF modeling, a training instance is a sequence of sen-                           BERT model as well as a BERT model that has been fine-tuned on
tences within the rule document section boundaries. To address the                        approximately 6,000 rule documents from our corpus that have
label sparsity for the comment dismissal/revision classification, we                      not been included in the annotated document sets. To this end, we
explore the utility of training models only on data that is known to                      rely on a PyTorch[47] implementation of BERT.14 The size of the
contain comment discussion (i.e. on the non-ignorable sentences)                          generated sentence/header embedding is 728. The fine-tuned model
and then composing a two tiered model to first detect comment                             was trained for seven epochs.
discussions, and then then classify their polarity. The hyperparam-
eters have been tuned by fitting the models to the dev1-set and                           5     EVALUATION
evaluating results on the dev2-set.
                                                                                          We evaluate the quality of the rule document annotation using Co-
                                                                                          hen’s kappa coefficient [10], as well as qualitatively. Performance of
8 For example, there were several first level section titles “What comments did EPA
                                                                                          our baseline text segmentation models is evaluated on the test set at
receive?”.
9 We use scikit-learn version 0.20.2 SVC implementation [48] with an error term penalty   the sentence level using area under the ROC curve (AUC), F1-score,
parameter of 1, and 1,500 as the maximum number of iterations.                            precision, and recall. We found a sentence to be the most meaning-
10 We use PyStruct 0.3.2 implementation [42] of margin re-scaled structural SVM
                                                                                          ful operational definition of a passage, because comment-discussing
using the 1-slack formulation and cutting plane method [28]. We used regularization
parameter of 0.1 and 1,500 as the maximum number of iterations.                           13 We do not use a TFIDF feature representation because it has not performed as well
11 We have been unable to fit kernelized polynomial and RBF SVMs to our data because
                                                                                          as a simple count-based featurizer in our preliminary experiments.
these methods do not scale well to the size of our dataset.                               14 PyTorch Pretrained BERT: The Big and Extending Repository of pretrained Trans-
12 We use a scikit-learn version 0.20.2 MLP implementation [48] with one hidden layer     formers from https://github.com/huggingface/pytorch-pretrained-BERT. We used the
of 100 units optimized for at most 100 epochs at the default settings.                    bert-base-uncased version of the model.
Segmentation of Rulemaking Documents                                                          Conference’17, July 2017, Washington, DC, USA


sentences are often interspersed with ignorable sentences of a sec-                 Table 1: Characteristics of the Annotated Data
tion or a paragraph. For each model, the classification cutoff has
been determined using a threshold that maximizes the F1-score on            Characteristic                          Dev1-      Dev2-      Test-
the training data.                                                                                                  set        set        set
                                                                                            Number of the Data Set Elements
6 RESULTS
                                                                            Dockets                            75        76               73
6.1 Annotation                                                              Documents                          116       136              99
Table 1 summarizes the key properties of the annotated dataset.             Sections                           2,197     2,123            1,766
For this summary, we have converted span-level annotations into             Sentences                          72,969    61,837           61,042
sentence-level annotations. To this end, we have assigned a label to        Words                              1,820,619 1,583,518        1,430,134
a sentence if an annotator has marked 80% of tokens that make up                        Number of the Annotated Sentences
that sentence. For documents that have been annotated by multiple           Non-ignorable content           19,465   20,105               12,979
individuals, we assign a label to a sentence if at least one individual     Comment dismissals              3,527    3,225                2,202
has labeled the sentence. This approach has been motivated by a             Comment-based regulatory change 2,092    1,015                1,088
qualitative examination of annotations, which revealed low recall is-
sues for some annotators. Depending on the dataset, non-ignorable                    Number of the Double-Annotated Sentences
content (i.e. text labeled as discussing comments) comprises 21%            Non-ignorable content*          42,296   25,300   41,572
to 33% of all sentences, comment dismissals comprise 4% to 5%               Refined content**               33,595   18,561   32,331
of all sentences, and comment-based revisions comprise 2% to 3%                               Annotator Agreement (Kappa)
of all sentences. Approximately half of all labeled sentences have          Non-ignorable sentences*                0.42      0.52      0.67
been annotated by two individuals. Due to the annotator attrition,          Non-ignorable sentences**               0.38      0.43      0.64
reliability annotations for a more refined labeling task (i.e., identifi-   Neutral comment discussion              0.39      0.44      0.66
cation of comment dismissals and comment-based rule revisions)              Comment dismissals                      0.32      0.18      0.29
are available for 73% to 79% of all double-annotated sentences.             Comment-based regulatory change 0.086             0.19      0.16
   Table 1 also reports the inter-annotator agreement statistics,           Multi-class                             0.33      0.38      0.56
while Table 2 summarizes agreement with the expert annotator                Notes: * Sentences for which double annotation of non-ignorable
on four final rule documents used as part of the annotator train-           content is available. ** Sentences for which double annotation of
ing. (Expert annotations have been produced by the first author,            content is also available.
who has 10 years of professional experience in supporting EPA’s
regulatory proposal development.) For the non-ignorable content,
                                                                                     Table 2: Annotator Agreement* with Expert
inter-annotator agreement scores range from 0.38 to 0.67 (depend-
ing on the dataset), whereas agreement with the expert is 0.74 on
average (range: 0.35–0.95). We note that agreement on this task ap-         Kappa                          Mean       Min         Max
pears to improve from the dev1 set to the test set, which may reflect       Non-ignorable content           0.74       0.35        0.95
that the annotators learned to do the task better over time, given the      Comment dismissals**            0.33       0           0.54
order in which the documents have been assigned. Inter-annotator            Comment-based revisions** 0.38             0           0.75
agreement for the comment dismissal labeling task ranges from
                                                                            Notes: * Agreement is calculated at the sentence level for four final
0.18 to 0.32, while agreement on the comment-based rule revisions
                                                                            rule documents. A total of 4,105 sentences are available for this
is very low, ranging between 0.086 and 0.19. Agreement with the
                                                                            evaluation. ** These statistics are calculated for the eight
expert on these tasks is also low: 0.33 (range: 0–0.54) for the com-
                                                                            annotators who performed the task.
ment dismissals and 0.38 (range: 0–0.75) for the comment-based
rule revisions.
   We have reviewed the annotator errors vis-a-vis the expert an-
notator. False negatives tend to occur most commonly when:                         passage requires complex inference. As such, the annotators
                                                                                   tended to be conservative about assigning these labels for
    • The annotator captures only the initial part of the com-                     less obvious examples.
      ment discussion that contains typical lexical cues (e.g., “EPA
      received comments suggesting...”, “Commenters noted...”,                 For the false positives, we have observed the following tenden-
      “EPA agrees with the commenters...”) but fails to include             cies:
      the entire—usually technical—comment discussion that can                  • EPA regulations are typically incremental, in that they often
      span multiple subsequent paragraphs;                                        tend to modify older, preexisting rules. Therefore, the final
    • A passage with comment discussion is “buried” in the middle                 and proposal rule document discuss changes/ revisions of
      of a longer paragraph, as often happens when comments are                   the prior regulatory standard. This has been a significant
      discussed in the background section;                                        source of confusion for the annotators, who found it diffi-
    • For the more difficult annotation task of identifying comment-              cult to separate comment-based revisions of the proposed
      based rule revisions and comment dismissals, we have noted                  regulation from the revisions of the regulatory standard on
      that false negatives tend to occur when the evaluation of the               the regulatory agenda, leading to false positives.
Conference’17, July 2017, Washington, DC, USA                                                                            Belova, Grabmair, and Nyberg


      • Another challenge for the annotators has been the decision                          Table 3: Baseline Test Set Results
        of when the discussion switches from comment-related to
        the general topics, also leading to false positives.             Model                                   AUC          F1          Prec.        Recall
      • Specifically for the comment-based rule revisions, some an-
                                                                                                   All Non-ignorable Content
        notators found it challenging to distinguish between revi-
        sions of the proposed rule that were based on comments           Random                                  0.501        0.200       0.164        0.256
        from revisions that occurred for other reasons. For example,     CRF+HCF                                 n.a          0.717       0.750        0.687
        the EPA may implement revisions based on new evidence            SVM+HCF                                 0.911        0.716       0.734        0.698
        that emerges after the proposed rule is submitted for public     SVM+BERT (as is)                        0.921        0.695       0.721        0.672
                                                                         SVM+HCF+BERT (as is)                    0.915        0.689       0.753        0.636
        review.
                                                                         SVM+BERT (tuned)                        0.928        0.703       0.764        0.651
                                                                         SVM+HCF+BERT (tuned)                    0.913        0.693       0.709        0.677
6.2     Classification Results
                                                                                                      Comment Dismissals
Table 3 and Table 4 show the test set evaluation performance results
                                                                         Random                                  0.502        0.020       0.017        0.023
for each binary classification task divided by learning framework
                                                                         Semi-Random                             0.811        0.152       0.162        0.144
and feature set. The models have produced better than random pre-
dictions, with largest AUC of 0.937 noted for the non-ignorable con-     CRF+HCF                                 n.a.         0.209       0.225        0.195
                                                                         SVM+HCF                                 0.760        0.258       0.177        0.478
tent prediction and smallest AUC of 0.677 noted for the comment-
                                                                         SVM+BERT (as is)                        0.869        0.277       0.194        0.484
based rule change prediction. These patterns largely reflect the         SVM+HCF+BERT (as is)                    0.862        0.258       0.170        0.537
differences in the quality of annotations obtained for our prediction    SVM+BERT (tuned)                        0.869        0.278       0.196        0.478
tasks, with the segmentation task being significantly easier than        SVM+HCF+BERT (tuned)                    0.768        0.257       0.191        0.393
the comment response classification task.                                2-SVM+HCF                               0.872        0.281       0.202        0.460
   For the non-ignorable content prediction, the models produce          2-SVM+BERT (as is)                      0.874        0.286       0.214        0.432
recall in the range of 0.636–0.708 and precision in the range of         2-SVM+HCF+BERT (as is)                  0.881        0.257       0.214        0.322
0.688–0.798. Unsurprisingly, for the more complex annotation tasks       2-SVM+BERT (tuned)                      0.888        0.318       0.249        0.441
with low annotator agreement, classification quality is poor. For the    2-SVM+HCF+BERT (tuned)                  0.830        0.271       0.212        0.375
comment dismissal prediction, recall is 0.085–0.537 and precision is                         Comment-based Regulatory Change
0.091–0.249, whereas for the comment-based rule change prediction,
                                                                         Random                                  0.503        0.038       0.032        0.046
recall is 0.065–0.490 and precision is 0.056–0.189.
                                                                         Semi-Random                             0.678        0.050       0.053        0.048
6.2.1 Linear Model Analysis. CRF model results do not appear to          CRF+HCF                                 n.a.         0.088       0.091        0.085
be materially different from those generated by the SVM model            SVM+HCF                                 0.677        0.092       0.056        0.273
on the same handcrafted feature set, even through they take into         SVM+BERT (as is)                        0.802        0.126       0.074        0.420
account the labels of neighboring sentences. We note, however,           SVM+HCF+BERT (as is)                    0.736        0.099       0.058        0.335
                                                                         SVM+BERT (tuned)                        0.815        0.125       0.077        0.337
that the CRF models have produced consistently higher precision
                                                                         SVM+HCF+BERT (tuned)                    0.754        0.091       0.051        0.446
scores, compared to the SVM models estimated on the same feature
                                                                         2-SVM+HCF                               0.724        0.081       0.091        0.073
set. Because we experienced some convergence problems with CRF
                                                                         2-SVM+BERT (as is)                      0.796        0.104       0.112        0.097
models, we have fit them to only one feature set.
                                                                         2-SVM+HCF+BERT (as is)                  0.745        0.078       0.075        0.081
   Table 3 also shows that neural BERT features on average tend to       2-SVM+BERT (tuned)                      0.808        0.086       0.128        0.065
generate higher AUC, precision, and recall. We note that the two-        2-SVM+HCF+BERT (tuned)                  0.744        0.108       0.088        0.138
tiered models perform better for the comment dismissal prediction,       Notes: Random – predictions are draws from a Bernoulli distribution with probability
but not for the comment-based revision prediction. In the latter case,   set to the target class prior. Semi-Random – predictions are generated by first applying
the gains in precision are minor and do not offset the significant       the best-performing non-ignorable content classifier and then drawing from a
losses in recall.                                                        Bernoulli distribution with probability set to the target class conditional prior. 2-SVM
   We also observe that neural features based on the fine-tuned          – a two-tiered SVM model. HCF – hand crafted features. AUC – area under the ROC
BERT can perform better than those using out-of-the-box BERT             curve. CRF model does not produce confidence scores, hence AUC estimation was not
(e.g. best AUC and precision on non-ignorable content prediction).       possible. The classification cutoff was chosen to maximize F1 score for each model.

Interestingly, combining neural and handcrafted feature sets gener-
ally does not produce synergy performance increases, which could         obtained on for the MLP with an identity transformation (MLP-Id)
be due to the substantial increase in the overall feature dimension,     before the final softmax.15
or the lack of feature interaction capacity in linear models.               We observe that nonlinear models using BERT features can
                                                                         achieve somewhat higher AUC and F1 scores than the linear models
6.2.2 Multi-Layer Perceptron Results. In a second set of experi-
                                                                         shown in Table 3. We also see that adding handcrafted features to
ments we assessed whether classification performance increases
with models that allow for feature interactions. To this end, we         15 We have also obtained results for the MLP with an a Rectified Linear Unit (ReLU)

trained a series of Multi-Layer-Perceptron models (i.e. a neural net-    activation function before the final softmax MLP-ReLU. The practical difference is that
                                                                         a ReLU activation will truncate all incoming negative activation values to 0 and leave
work with one hidden layer of size 100 and a two-class softmaxed         positive ones unchanged. We do not report these results because they were largely
output) on our tasks and feature sets. Table 4 contains the results we   inferior to those obtained for the MLP-Id variant.
Segmentation of Rulemaking Documents                                                                       Conference’17, July 2017, Washington, DC, USA


a model can occasionally yield some performance synergy. From                              False Positives: The models tend to produce false positives when
this we infer that nonlinear models could potentially produce bet-                      sentences contain certain trigger words (such as “response”, “re-
ter results on our dataset, and hence we plan to experiment with                        vision”, “finalizing the rule as proposed”) yet the overall context
recurrent or dilated convolutional models for sequence tagging to                       of the passage is not related to the discussion of public comments.
leverage the document context in future work.                                           For example, these trigger words have been observed in passages
                                                                                        discussing petitions and revisions of the regulatory standard that
                                                                                        are not based on comments, similar to mistakes made by human
                  Table 4: Auxiliary Test Set Results
                                                                                        annotators. There is also a fair share of label noise: As noted earlier,
                                                                                        the annotators have been challenged by longer comment discus-
Model                                        AUC         F1         Prec.      Recall   sions and occasionally failed to capture the entire relevant span.
                          All Non-ignorable Content                                     We also conjecture that in this case the models have been guided
                                                                                        by the section-header related signal.
Random                                       0.501       0.200      0.164      0.256
MLP-Id+HCF                                   0.911       0.711      0.776      0.656       False Negatives: The false negatives tend to occur in sections that
MLP-Id+BERT (as is)                          0.917       0.678      0.688      0.669    do not commonly contain comment discussion (e.g., “Background”,
MLP-Id+HCF+BERT (as is)                      0.930       0.731      0.798      0.674    “Executive Order Review”). Sentences that lack the boilerplate lan-
MLP-Id+BERT (tuned)                          0.937       0.705      0.772      0.648    guage (e.g., “response”, “EPA”, “comment”) also tend to be missed
MLP-Id+HCF+BERT (tuned)                      0.930       0.732      0.759      0.708    more often. As with the false positives, we observed some amount
                              Comment Dismissals                                        of label noise, often in cases when the annotators mislabeled discus-
Random                                       0.502       0.020      0.017      0.023
                                                                                        sions of regulatory revisions that have not been driven by public
Semi-Random                                  0.811       0.152      0.162      0.144    feedback or when annotators have failed to determine an appropri-
                                                                                        ate boundaries for the technical discussion of comments.
MLP-Id+HCF                                   0.851       0.289      0.208      0.476
MLP-Id+BERT (as is)                          0.875       0.273      0.204      0.410       Label Confusion: We have observed several cases of the models
MLP-Id+HCF+BERT (as is)                      0.871       0.297      0.213      0.492    being confused about the polarity of EPA assessment, particularly
MLP-Id+BERT (tuned)                          0.893       0.284      0.209      0.442
                                                                                        when the sentence has included trigger words such as “agree” and
MLP-Id+HCF+BERT (tuned)                      0.825       0.291      0.212      0.460
                                                                                        “disagree” together.
2-MLP-Id+HCF                                 0.850       0.284      0.229      0.374
2-MLP-Id+BERT (as is)                        0.859       0.281      0.218      0.394       Parsing: We have noted several instances of erroneous sentence
2-MLP-Id+HCF+BERT (as is)                    0.887       0.301      0.243      0.395    parsing (e.g., a citation “40 CFR 51.1010(b).” has been isolated as a
2-MLP-Id+BERT (tuned)                        0.890       0.309      0.240      0.432    sentence) that lead to classification errors. This issue could be reme-
2-MLP-Id+HCF+BERT (tuned)                    0.882       0.294      0.239      0.383    died by a sentence boundary detector oriented towards processing
                    Comment-based Regulatory Change                                     legal text [54].
Random                                       0.503       0.038      0.032      0.046
Semi-Random                                  0.678       0.050      0.053      0.048    7    DISCUSSION
MLP-Id+HCF                                   0.718       0.103      0.061      0.329    It is likely possible to automatically identify certain type of con-
MLP-Id+BERT (as is)                          0.818       0.121      0.072      0.384    tent in regulatory documents with irregular structure. Our baseline
MLP-Id+HCF+BERT (as is)                      0.766       0.113      0.068      0.335    segmentation performance for detecting comment discussion sen-
MLP-Id+BERT (tuned)                          0.837       0.138      0.086      0.343    tences with recall in the range of 0.636–0.708 and precision in the
MLP-Id+HCF+BERT (tuned)                      0.806       0.114      0.065      0.490    range of 0.688–0.798. While we have focused on identifying com-
2-MLP-Id+HCF                                 0.757       0.078      0.093      0.068    ment discussion by the receiving agency, we believe that there are
2-MLP-Id+BERT (as is)                        0.723       0.092      0.077      0.114    other types of content (e.g., regulatory requirements) automated
2-MLP-Id+HCF+BERT (as is)                    0.770       0.092      0.112      0.078    segmentation of which may be both, desired and feasible. Detect-
2-MLP-Id+BERT (tuned)                        0.766       0.123      0.189      0.091
                                                                                        ing specific comment discussions that either dismiss comments
2-MLP-Id+HCF+BERT (tuned)                    0.789       0.130      0.113      0.154
                                                                                        or announce rule revision turns out to be a harder task for both
Notes: Random – predictions are draws from a Bernoulli distribution with probability
                                                                                        annotators and, consequently, for models. Moving forward, this
set to the target class prior. Semi-Random – predictions are generated by first
applying the best-performing non-ignorable content classifier and then drawing from
                                                                                        begs the question of which information need the model caters to. If
a Bernoulli distribution with probability set to the target class conditional prior.    value is added by quickly pointing an expert to comment discussion
2-MLP – a two-tiered MLP model. HCF – hand crafted features. AUC – area under the       passages, then a well-performing model is within reach given good
ROC curve. MLP-Id – a multi-layer perceptron with one hidden layer with 100 units       training data. On the other hand, an automated analysis of topics
and an identity non-linearity followed by a Softmax; this model is equivalent to a      for which comments have been influential remains a hard problem.
generalized linear regression model with interaction terms. The classification cutoff       We also note that our dataset has been compiled using highly
was chosen to maximize F1 score for each model.                                         educated non-expert annotators. We have found that this type of
                                                                                        background is sufficient for producing relatively coarse annota-
                                                                                        tions (e.g., identifying parts of the document that contain comment
6.2.3 Error Analysis. For our best-performing models we have                            discussion). We have measured the annotator-expert agreement
generated and examined five random examples for each type of                            of 0.74 for the comment discussion identification task. However,
error. Our findings are as follows:                                                     more refined annotation tasks, such as the ones determining the
Conference’17, July 2017, Washington, DC, USA                                                                                           Belova, Grabmair, and Nyberg


agency’s responses to public feedback, would likely require expert-                      [15] Catherine Dumas, Teresa M Harrison, Loni Hagen, and Xiaoyi Zhao. 2017. What
level understanding of the domain.                                                            Do the People Think?: E-Petitioning and Policy Decision Making. In Beyond
                                                                                              Bureaucracy. Springer, 187–207.
   We believe that our baseline modeling results can be further                          [16] Yixing Fan, Jiafeng Guo, Yanyan Lan, Jun Xu, Chengxiang Zhai, and Xueqi
improved by developing a fully neural sequence tagging model,                                 Cheng. 2018. Modeling diverse relevance patterns in ad-hoc retrieval. In The 41st
                                                                                              International ACM SIGIR Conference on Research & Development in Information
such as the one developed for the standard discourse segmentation                             Retrieval. ACM, 375–384.
corpus [56]. However, even with access to the sequence encoders                          [17] Vanessa Wei Feng and Graeme Hirst. 2014. Two-pass discourse segmentation
such as BERT, the limited size of our corpus may still present a                              with pairing and global features. arXiv preprint arXiv:1407.8215 (2014).
                                                                                         [18] Elisa Ferracane, Titan Page, Junyi Jessy Li, and Katrin Erk. 2019. From News to
modeling challenge.                                                                           Medical: Cross-domain Discourse Segmentation. arXiv preprint arXiv:1904.06682
                                                                                              (2019).
                                                                                         [19] Loni Hagen, Teresa M Harrison, and Catherine L Dumas. 2018. Data Analytics
8    CONCLUSIONS                                                                              for Policy Informatics: The Case of E-Petitioning. In Policy Analytics, Modelling,
We have produced a dataset and baseline for a novel discourse                                 and Informatics. Springer, 205–224.
                                                                                         [20] Loni Hagen, Teresa M Harrison, Özlem Uzuner, Tim Fake, Dan Lamanna, and
segmentation task of identifying public comment discussion and                                Christopher Kotfila. 2015. Introducing textual analysis tools for policy informat-
evaluation by regulatory agencies. In doing so we presented ev-                               ics: a case study of e-petitions. In Proceedings of the 16th annual international
idence that detecting comment discussions automatically using                                 conference on digital government research. ACM, 10–19.
                                                                                         [21] Loni Hagen, Özlem Uzuner, Christopher Kotfila, Teresa M Harrison, and Dan
mainstream NLP techniques is feasible given good training data.                               Lamanna. 2015. Understanding Citizens’ Direct Policy Suggestions to the Federal
Classifying discussions of a particular type is harder both because                           Government: A Natural Language Processing and Topic Modeling Approach.
                                                                                              In System Sciences (HICSS), 2015 48th Hawaii International Conference on. IEEE,
of data sparsity and low annotator agreement. While good general                              2134–2143.
detection performance will add value in some practical settings,                         [22] Mehedi Hasan, A Kotov, S Naar, GL Alexander, and A Idalski Carcone. 2019.
we see opportunity for further improvement in the use of neural                               Deep neural architectures for discourse segmentation in e-mail based behavioral
                                                                                              interventions. In American Medical Informatics Association (AMIA).
sequence tagging models, albeit subject to the limitations of data                       [23] Geoffrey E Hinton. 1990. Connectionist learning procedures. In Machine learning.
quality as a function of annotator expertise, training, and type                              Elsevier, 555–610.
system design.                                                                           [24] Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language under-
                                                                                              standing with Bloom embeddings, convolutional neural networks and incremen-
                                                                                              tal parsing. To appear (2017).
9    ACKNOWLEDGMENTS                                                                     [25] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for
                                                                                              sequence tagging. arXiv preprint arXiv:1508.01991 (2015).
The authors thank University of Pittsburgh Intelligent Systems                           [26] Yangfeng Ji and Jacob Eisenstein. 2014. Representation learning for text-level
Program student Jaromir Savelka for permission to use the Gloss                               discourse parsing. In Proceedings of the 52nd Annual Meeting of the Association
                                                                                              for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 13–24.
annotation tool.                                                                         [27] Robin Jia, Cliff Wong, and Hoifung Poon. 2019. Document-Level N -ary Re-
                                                                                              lation Extraction with Multiscale Representation Learning. arXiv preprint
                                                                                              arXiv:1904.02347 (2019).
REFERENCES                                                                               [28] Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu. 2009. Cutting-plane
 [1] Jaime Arguello and Jamie Callan. 2007. A bootstrapping approach for identifying          training of structural SVMs. Machine Learning 77, 1 (2009), 27–59.
     stakeholders in public-comment corpora. In Proceedings of the 8th annual interna-   [29] Barbara Konat, John Lawrence, Joonsuk Park, Katarzyna Budzynska, and Chris
     tional conference on Digital government research: bridging disciplines & domains.        Reed. 2016. A Corpus of Argument Networks: Using Graph Properties to Analyse
     Digital Government Society of North America, 92–101.                                     Divisive Issues.. In LREC.
 [2] Jaime Arguello, Jamie Callan, and Stuart Shulman. 2008. Recognizing citations in    [30] Namhee Kwon, Stuart W Shulman, and Eduard Hovy. 2006. Multidimensional
     public comments. Journal of Information Technology & Politics 5, 1 (2008), 49–71.        text analysis for eRulemaking. In Proceedings of the 2006 international conference
 [3] Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein. 2015. Better document-level         on Digital government research. Digital Government Society of North America,
     sentiment analysis from rst discourse parsing. arXiv preprint arXiv:1509.01599           157–166.
     (2015).                                                                             [31] Namhee Kwon, Liang Zhou, Eduard Hovy, and Stuart W Shulman. 2007. Identify-
 [4] Mohammad Hadi Bokaei, Hossein Sameti, and Yang Liu. 2016. Extractive sum-                ing and classifying subjective claims. In Proceedings of the 8th annual international
     marization of multi-party meetings through discourse segmentation. Natural               conference on Digital government research: bridging disciplines & domains. Digital
     Language Engineering 22, 1 (2016), 41–72.                                                Government Society of North America, 76–81.
 [5] Claire Cardie, Cynthia R Farina, Matt Rawding, and Adil Aijaz. 2008. An erule-      [32] John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional
     making corpus: Identifying substantive issues in public comments. (2008).                random fields: Probabilistic models for segmenting and labeling sequence data.
 [6] Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. 2003. Building a                   (2001).
     discourse-tagged corpus in the framework of rhetorical structure theory. In         [33] Gloria T Lau. 2004. A comparative analysis framework for semi-structured docu-
     Current and new directions in discourse and dialogue. Springer, 85–112.                  ments, with applications to government regulations. Stanford University.
 [7] Nuno Carvalho and Rui Pedro Lourenço. 2018. E-Rulemaking: Lessons from the          [34] John Lawrence, Joonsuk Park, Katarzyna Budzynska, Claire Cardie, Barbara
     Literature. International Journal of Technology and Human Interaction (IJTHI) 14,        Konat, and Chris Reed. 2017. Using argumentative structure to interpret debates
     2 (2018), 35–53.                                                                         in online deliberative democracy and eRulemaking. ACM Transactions on Internet
 [8] Lijun Chen. 2007. Summaritive digest for large document repositories with                Technology (TOIT) 17, 3 (2017), 25.
     application to e-rulemaking. (2007).                                                [35] Karen EC Levy and Michael Franklin. 2014. Driving regulation: using topic
 [9] Cary Coglianese. 2004. E-Rulemaking: Information technology and the regulatory           models to examine political contention in the US trucking industry. Social Science
     process. Administrative Law Review (2004), 353–402.                                      Computer Review 32, 2 (2014), 182–194.
[10] Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational       [36] Junyi Jessy Li, Kapil Thadani, and Amanda Stent. 2016. The role of discourse
     and psychological measurement 20, 1 (1960), 37–46.                                       units in near-extractive summarization. In Proceedings of the 17th Annual Meeting
[11] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine               of the Special Interest Group on Discourse and Dialogue. 137–147.
     learning 20, 3 (1995), 273–297.                                                     [37] Michael A Livermore, Vladimir Eidelman, and Brian Grom. 2017. Computationally
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:            assisted regulatory participation. Notre Dame L. Rev. 93 (2017), 977.
     Pre-training of Deep Bidirectional Transformers for Language Understanding.         [38] Daniel Marcu. 2000. The theory and practice of discourse parsing and summariza-
     CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805             tion. MIT press.
[13] Tao Ding and Shimei Pan. 2016. How Reliable Is Sentiment Analysis? A Multi-         [39] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
     domain Empirical Investigation. In International Conference on Web Information           Distributed representations of words and phrases and their compositionality. In
     Systems and Technologies. Springer, 37–57.                                               Advances in neural information processing systems. 3111–3119.
[14] Lauren M Dinour and Antoinette Pole. 2017. Potato Chips, Cookies, and Candy         [40] John E. Moody. 1988.          Fast Learning in Multi-Resolution Hierarchies.
     Oh My! Public Commentary on Proposed Rules Regulating Competitive Foods.                 In Advances in Neural Information Processing Systems 1, [NIPS Confer-
     Health Education & Behavior 44, 6 (2017), 867–875.                                       ence, Denver, Colorado, USA, 1988]. 29–39.             http://papers.nips.cc/paper/
Segmentation of Rulemaking Documents                                                       Conference’17, July 2017, Washington, DC, USA


     175-fast-learning-in-multi-resolution-hierarchies
[41] Peter Muhlberger, Nick Webb, and Jennifer Stromer-Galley. 2008. The Deliberative
     E-Rulemaking project (DeER): improving federal agency rulemaking via natural
     language processing and citizen dialogue. In Proceedings of the 2008 international
     conference on Digital government research. Digital Government Society of North
     America, 403–404.
[42] Andreas C. Müller and Sven Behnke. 2014. pystruct - Learning Structured
     Prediction in Python. Journal of Machine Learning Research 15 (2014), 2055–
     2060. http://jmlr.org/papers/v15/mueller14a.html
[43] Joonsuk Park. 2016. Mining and evaluating argumentative structures in user
     comments in eRulemaking. Cornell University.
[44] Joonsuk Park, Cheryl Blake, and Claire Cardie. 2015. Toward machine-assisted
     participation in eRulemaking: An argumentation model of evaluability. In Pro-
     ceedings of the 15th International Conference on Artificial Intelligence and Law.
     ACM, 206–210.
[45] Joonsuk Park and Claire Cardie. 2014. Identifying appropriate support for propo-
     sitions in online user comments. In Proceedings of the First Workshop on Argu-
     mentation Mining. 29–38.
[46] Joonsuk Park, Sally Klingel, Claire Cardie, Mary Newhart, Cynthia Farina, and
     Joan-Josep Vallbé. 2012. Facilitative moderation for online participation in eRule-
     making. In Proceedings of the 13th Annual International Conference on Digital
     Government Research. ACM, 173–182.
[47] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
     Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
     2017. Automatic differentiation in PyTorch. In NIPS-W.
[48] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
     Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
     napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
     Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
[49] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher
     Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word
     representations. arXiv preprint arXiv:1802.05365 (2018).
[50] Rachel A Potter. 2017.              More than spam? Lobbying the EPA
     through public comment campaigns. In Brookings Series on Regula-
     tory Process and Perspective.              https://www.brookings.edu/research/
     more-than-spam-lobbying-the-epa-through-public-comment-campaigns
[51] Stephen Purpura, Claire Cardie, and Jesse Simons. 2008. Active learning for
     e-rulemaking: Public comment categorization. In Proceedings of the 2008 interna-
     tional conference on Digital government research. Digital Government Society of
     North America, 234–243.
[52] Reza Rajabiun. 2015. Beyond Transparency: The Semantics of Rulemaking for an
     Open Internet. Ind. LJ Supp. 91 (2015), 33.
[53] Reza Rajabiun and Catherine Middleton. 2015. Public Interest in the Regulation of
     Competition: Evidence from Wholesale Internet Access Consultations in Canada.
     Journal of Information Policy 5 (2015), 32–66.
[54] Jaromir Savelka, Vern R Walker, Matthias Grabmair, and Kevin D Ashley. 2017.
     Sentence boundary detection in adjudicatory decisions in the united states. Traite-
     ment automatique des langues 58, 2 (2017), 21–45.
[55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
     Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
     you need. In Advances in neural information processing systems. 5998–6008.
[56] Yizhong Wang, Sujian Li, and Jingfeng Yang. 2018. Toward Fast and Accurate
     Neural Discourse Segmentation. arXiv preprint arXiv:1808.09147 (2018).
[57] Antje Witting. 2015. Measuring the use of knowledge in policy development.
     Central European Journal of Public Policy 9, 2 (2015), 54–62.
[58] Hui Yang and Jamie Callan. 2005. Near-duplicate detection for eRulemaking. In
     Proceedings of the 2005 national conference on Digital government research. Digital
     Government Society of North America, 78–86.
[59] Hui Yang and Jamie Callan. 2008. Ontology generation for large email collections.
     In Proceedings of the 2008 international conference on Digital government research.
     Digital Government Society of North America, 254–261.
[60] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
     Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards
     story-like visual explanations by watching movies and reading books. In Proceed-
     ings of the IEEE international conference on computer vision. 19–27.