=Paper= {{Paper |id=Vol-2645/paper5 |storemode=property |title=A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering |pdfUrl=https://ceur-ws.org/Vol-2645/paper5.pdf |volume=Vol-2645 |authors=Nils Holzenberger,Andrew Blair-Stanek,Benjamin Van Durme |dblpUrl=https://dblp.org/rec/conf/kdd/HolzenbergerBD20 }} ==A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering== https://ceur-ws.org/Vol-2645/paper5.pdf
    A Dataset for Statutory Reasoning in Tax Law Entailment and
                         Question Answering
               Nils Holzenberger                                      Andrew Blair-Stanek                           Benjamin Van Durme
            Johns Hopkins University                          U. of Maryland Carey School of Law                    Johns Hopkins University
            Baltimore, Maryland, USA                               Baltimore, Maryland, USA                         Baltimore, Maryland, USA
                  nilsh@jhu.edu                                     Johns Hopkins University                          vandurme@cs.jhu.edu
                                                                   Baltimore, Maryland, USA
                                                               ablair-stanek@law.umaryland.edu

ABSTRACT                                                                               the formalisms provided by databases [21]. The main bottleneck of
Legislation can be viewed as a body of prescriptive rules expressed                    this approach is that experts are slow in building such knowledge
in natural language. The application of legislation to facts of a case                 bases and exhibit imperfect recall, which motivated research into
we refer to as statutory reasoning, where those facts are also ex-                     models for automatic information extraction (e.g. Lafferty et al. [36]).
pressed in natural language. Computational statutory reasoning is                      Systems for large-scale automatic knowledge base construction
distinct from most existing work in machine reading, in that much                      have improved (e.g. Etzioni et al. [20], Mitchell et al. [41]), as well
of the information needed for deciding a case is declared exactly                      as systems for sentence level semantic parsing [64]. Among others,
once (a law), while the information needed in much of machine read-                    this effort has led to question-answering systems for games [22]
ing tends to be learned through distributional language statistics.                    and, more recently, for science exams [14, 23, 27]. The challenges
To investigate the performance of natural language understanding                       include extracting ungrounded knowledge from semi-structured
approaches on statutory reasoning, we introduce a dataset, together                    sources, e.g. textbooks, and connecting high-performance symbolic
with a legal-domain text corpus. Straightforward application of ma-                    solvers with large-scale language models.
chine reading models exhibits low out-of-the-box performance on                            In parallel, models have begun to consider task definitions like
our questions, whether or not they have been fine-tuned to the legal                   Machine Reading (MR) [46] and Recognizing Textual Entailment
domain. We contrast this with a hand-constructed Prolog-based                          (RTE) [15, 16] as not requiring the use of explicit structure. Instead,
system, designed to fully solve the task. These experiments support                    the problem is cast as one of mapping inputs to high-dimensional,
a discussion of the challenges facing statutory reasoning moving                       dense representations that implicitly encode meaning [18, 45], and
forward, which we argue is an interesting real-world task that can                     are employed in building classifiers or text decoders, bypassing
motivate the development of models able to utilize prescriptive                        classic approaches to symbolic inference.
rules specified in natural language.                                                       This work is concerned with the problem of statutory reasoning
                                                                                       [62, 66]: how to reason about an example situation, a case, based
CCS CONCEPTS                                                                           on complex rules provided in natural language. In addition to the
                                                                                       reasoning aspect, we are motivated by the lack of contemporary
• Applied computing → Law; • Computing methodologies →
                                                                                       systems to suggest legal opinions: while there exist tools to aid
Natural language processing; Knowledge representation and reason-
                                                                                       lawyers in retrieving relevant documents for a given case, we are
ing.
                                                                                       unaware of any strong capabilities in automatic statutory reasoning.
                                                                                           Our contributions, summarized in Figure 2, include a novel
KEYWORDS
                                                                                       dataset based on US tax law, together with test cases (Section 2).
Law, NLP, Reasoning, Prolog                                                            Decades-old work in expert systems could solve problems of the
ACM Reference Format:                                                                  sort we construct here, based on manually derived rules: we repli-
Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. 2020.                  cate that approach in a Prolog-based system that achieves 100%
A Dataset for Statutory Reasoning in Tax Law Entailment and Question                   accuracy on our examples (Section 3). Our results demonstrate
Answering. In Proceedings of the 2020 Natural Legal Language Processing                that straightforward application of contemporary Machine Read-
(NLLP) Workshop, 24 August 2020, San Diego, US. ACM, New York, NY, USA,                ing models is not sufficient for our challenge examples (Section
8 pages.                                                                               5), whether or not they were adapted to the legal domain (Section
1    INTRODUCTION                                                                      4). This is meant to provoke the question of whether we should
                                                                                       be concerned with: (a) improving methods in semantic parsing
Early artificial intelligence research focused on highly-performant,
                                                                                       in order to replace manual transduction into symbolic form; or
narrow-domain reasoning models, for instance in health [37, 40,
                                                                                       (b) improving machine reading methods in order to avoid explicit
54] and law [30, 38]. Such expert systems relied on hand-crafted
                                                                                       symbolic solvers. We view this work as part of the conversation
inference rules and domain knowledge, expressed and stored with
                                                                                       including recent work in multi-hop inference [61], where our task
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons   is more domain-specific but potentially more challenging.
License Attribution 4.0 International (CC BY 4.0).
NLLP @ KDD 2020, August 24th, San Diego, US
© 2020 Copyright held by the owner/author(s).
NLLP @ KDD 2020, August 24th, San Diego, US                                                                Holzenberger, Blair-Stanek and Van Durme


                                                                            Slots are another major feature of the law. Each subsection refers
                                                                         to a certain number of slots, which may be filled by existing entities
                                                                         (in the above, individual, spouse, and decree of divorce or of separate
                                                                         maintenance). Certain slots are implicitly filled: §7703(a)(1) and
                                                                         (b)(3) mention a “spouse", which must exist since the “individual" is
                                                                         married. Similarly, slots which have been filled earlier in the section
                                                                         may be referred to later on. For instance, “household" is mentioned
                                                                         for the first time in §7703(b)(1), then again in §7703(b)(2) and in
                                                                         §7703(b)(3). Correctly resolving slots is a key point in successfully
                                                                         applying the law.
Figure 1: Sample cases from our dataset. The questions can                  Overall, the IRC can be framed as a set of predicates formulated in
be answered by applying the rules contained in the statutes              human language. The language used to express the law has an open
to the context.                                                          texture [29], which makes it particularly challenging for a computer-
                                                                         based system to determine whether a subsection applies, and to
                                                                         identify and fill the slots mentioned. This makes the IRC an excellent
                                                                         corpus to build systems that reason with rules specified in natural
                                                                         language, and have good language understanding capabilities.



                                                                         2.1    Statutes and test cases
                                                                         As the basis of our set of rules, we selected sections of the IRC
                                                                         well-supported by Treasury Regulations, covering tax on individu-
                                                                         als (§1), marriage and other legal statuses (§2, 7703), dependents
                                                                         (§152), tax exemptions and deductions (§63, 68, 151) and employ-
                                                                         ment (§3301, 3306). We simplified the sections to (1) remove highly
                                                                         specific sections (e.g. those concerning the employment of sailors)
                                                                         in order to keep the statutes to a manageable size, and (2) ensure
                                                                         that the sections only refer to sections from the selected subset. For
                                                                         ease of comparison with the original statutes, we kept the original
                                                                         numbering and lettering, with no adjustment for removed sections.
                                                                         For example, there is a section 63(d) and a section 63(f), but no
                                                                         section 63(e). We assumed that any taxable year starts and ends at
Figure 2: Resources. Corpora on the left hand side were used             the same time as the corresponding calendar year.
to build the datasets and models on the right hand side.                    For each subsection extracted from the statutes, we manually
2    DATASET                                                             created two paragraphs in natural language describing a case, one
                                                                         where the statute applies, and one where it does not. These snippets,
Here, we describe our main contribution, the StAtutory Reason-
                                                                         formulated as a logical entailment task, are meant to test a system’s
ing Assessment dataset (SARA): a set of rules extracted from the
                                                                         understanding of the statutes, as illustrated in Figure 1. The cases
statutes of the US Internal Revenue Code (IRC), together with a
                                                                         were vetted by a law professor for coherence and plausibility. For
set of natural language questions which may only be answered
                                                                         the purposes of machine learning, the cases were split into 176
correctly by referring to the rules1 .
                                                                         train and 100 test samples, such that (1) each pair of positive and
   The IRC2 contains rules and definitions for the imposition and
                                                                         negative cases belongs to the same split, and (2) each section is split
calculation of taxes. It is subdvided into sections, which in general,
                                                                         between train and test in the same proportions as the overall split.
define one or more terms: section 3306 defines the terms employ-
                                                                            Since tax legislation makes it possible to predict how much tax
ment, employer and wages, for purposes of the federal unemploy-
                                                                         a person owes, we created an additional set of 100 cases where the
ment tax. Sections are typically structured around a general rule,
                                                                         task is to predict how much tax someone owes. Those cases were
followed by a number of exceptions. Each section and its subsec-
                                                                         created by randomly mixing and matching pairs of cases from the
tions may be cast as a predicate whose truth value can be checked
                                                                         first set of cases, and resolving inconsistencies manually. Those
against a state of the world. For instance, subsection 7703(a)(2):
                                                                         cases are no longer a binary prediction task, but a task of predicting
        an individual legally separated from his spouse under            an integer. The prediction results from taking into account the
        a decree of divorce or of separate maintenance shall not         entirety of the statutes, and involves basic arithmetic. The 100 cases
        be considered as married                                         were randomly split into 80 training and 20 test samples.
can be checked given an individual.                                         Because the statutes were simplified, the answers to the cases
                                                                         are not those that would be obtained with the current version of
1 The dataset can be found under https://nlp.jhu.edu/law/
                                                                         the IRC. Some of the IRC counterparts of the statutes in our dataset
2 https://uscode.house.gov/browse/prelim@title26&edition=prelim
                                                                         have been repealed, amended, or adjusted to reflect inflation.
A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering                                        NLLP @ KDD 2020, August 24th, San Diego, US


                  Cross-references       Explicit   Implicit                                                    min     max       avg ± stddev    median
                  Within the section          30         25                                Depth of leaf          1       6           3.6 ± 0.8        4
                  To another section          34         44                                Depth of node          0       6           3.2 ± 1.0        3
Table 1: Number of subsections containing cross-references                           Table 2: Statistics about the tree structure of the statutes
                                                                                  Vocabulary                              train      867       statutes       768
2.2     Key features of the corpus                                                size                                     test      535     combined        1596
While the corpus is based on a simplification of the Internal Rev-                                               min       max       avg        stddev     median
enue Code, care was taken to retain prominent features of US law.                 Sentence              train      4        138      12.3           9.1        11
We note that the present task is only one aspect of legal reason-                 length                 test      4         34      11.6           4.5        10
ing, which in general involves many more modes of reasoning, in                   (in words)         statutes      1         88      16.5          14.9       12.5
                                                                                                   combined        1        138      12.7           9.5        11
particular interpreting regulations and prior judicial decisions. The
                                                                                  Case length           train      1          9       4.2           1.7          4
following features are quantified in Tables 1 to 4.
                                                                                  (in                    test      2          7       3.8           1.3          4
   Reasoning with time. The timing of events (marriage, retirement,               sentences)       combined        1          9       4.1           1.6          4
income...) is highly relevant to determining whether certain sections             Case                  train     17        179      48.5          22.2        43
apply, as tax is paid yearly. In total, 62 sections refer to time. Some           length                 test     17         81      41.6          14.7        38
sections require counting days, as in §7703(b)(1):                                (in words)       combined       17        179      46.3          20.3        41
        a household which constitutes for more than one-half of                   Section          sentences       2         16       8.3           4.7          9
        the taxable year the principal place of abode of a child                  length               words      62      1151      488.9         310.4       549
or taking into account the absolute point in time as in §63(c)(7):               Table 3: Language statistics. The word “combined” means
                                                                                 merging the corpora mentioned above it.
        In the case of a taxable year beginning after December
        31, 2017, and before January 1, 2026-                                                      min          max       average           stddev     median
                                                                                          train      0     2,242,833     85,804.86      258,179.30    15,506.50
   Exceptions and substitutions. Typically, each section of the IRC
                                                                                           test      0       243,097     65,246.50       78,123.13    26,874.00
starts by defining a general case and then enumerates a number of                     combined       0     2,242,833     81,693.19      233,695.33    17,400.50
exceptions to the rule. Additionally, some rules involve applying                         Table 4: Answers to numerical questions (in $).
a rule after substituting terms. A total of 50 sections formulate an
exception or a substitution. As an example, §63(f)(3):                           (4) a person’s gross income is the sum of all income and payments
        In the case of an individual who is not married and is                   received by that person.
        not a surviving spouse, paragraphs (1) and (2) shall be                     Hierarchical structure. Law statutes are divided into sections,
        applied by substituting “$750" for “$600".                               themselves divided into subsections, with highly variable depth and
   Numerical reasoning. Computing tax owed requires knowledge of                 structure. This can be represented by a tree, with a special ROOT
the basic arithmetic operations of adding, subtracting, multiplying,             node of depth 0 connecting all the sections. This tree contains 132
dividing, rounding and comparing numbers. 55 sections involve                    leaves and 193 nodes (node includes leaves). Statistics about depth
numerical reasoning. The operation to be used needs to be parsed                 are in Table 2.
out of natural text, as in §1(c)(2):                                             3     PROLOG SOLVER
        $3,315, plus 28% of the excess over $22,100 if the taxable               It has been shown that subsets of statutes can be expressed in first-
        income is over $22,100 but not over $53,500                              order logic, as described in Section 6. As a reaffirmation of this, and
   Cross-references. Each section of the IRC will typically reference            as a topline for our task, we have manually translated the statutes
other sections. Table 1 shows how this feature was preserved in                  into Prolog rules and the cases into Prolog facts, such that each case
our dataset. There are explicit references within the same section,              can be answered correctly by a single Prolog query3 . The Prolog
as in §7703(b)(1):                                                               rules were developed based on the statutes, meaning that the Prolog
        an individual who is married (within the meaning of                      code clearly reflects the semantics of the textual form, as in Gunning
        subsection (a)) and who files a separate return                          et al. [27]. This is primarily meant as a proof that a carefully crafted
explicit references to another section, as in §3301:                             reasoning engine, with perfect natural language understanding, can
                                                                                 solve this dataset. There certainly are other ways of representing
        There is hereby imposed on every employer (as defined
                                                                                 this given set of statutes and cases. The point of this dataset is not
        in section 3306(a)) for each calendar year an excise tax
                                                                                 to design a better Prolog system, but to help the development of
and implicit references, as in §151(a), where “taxable income" is                language understanding models capable of reasoning.
defined in §63:
        the exemptions provided by this section shall be allowed
                                                                                 3.1     Statutes
        as deductions in computing taxable income.                               Each subsection of the statutes was translated with a single rule,
                                                                                 true if the section applies, false otherwise. In addition, subsections
   Common sense knowledge. Four concepts, other than time, are left
                                                                                 define slots that may be filled and reused in other subsections, as
undefined in our statutes: (1) kinship, (2) the fact that a marriage
                                                                                 described in Section 2. To solve this coreference problem, any term
ends if either spouse dies, (3) if an event has not ended, then it
                                                                                 appearing in a subsection and relevant across subsections is turned
is ongoing; if an event has no start, it has been true at any time
before it ends; and some events are instantaneous (e.g. payments ),              3 The Prolog program can be found under https://nlp.jhu.edu/law/
NLLP @ KDD 2020, August 24th, San Diego, US                                                                 Holzenberger, Blair-Stanek and Van Durme


into an argument of the Prolog rule. The corresponding variable           with the United States as a party, and with the word tax appearing
may then be bound during the execution of a rule, and reused in a         in the first 400 words of the case’s written opinion.
rule executed later. Unfilled slots correspond to unbound variables.         The second half of this corpus consists of IRS private letter rul-
   To check whether a given subsection applies, the Prolog sys-           ings and unpublished U.S. Tax Court cases. IRS private letter rulings
tem needs to rely on certain predicates, which directly reflect the       are similar to cases, in that they apply tax law to one taxpayer’s
facts contained in the natural language descriptions of the cases.        facts; they differ from cases in that they are written by IRS attorneys
For instance, how do we translate Alice and Bob got married on            (not judges), have less precedential authority than cases, and redact
January 24th, 1993 into code usable by Prolog? We rely on a set           names to protect taxpayer privacy. Unpublished U.S. Tax Court
of 61 predicates, following neo-davidsonian semantics [9, 17, 42].        cases are viewed by the judges writing them as less important than
The level of detail of these predicates is based on the granularity       those worthy of publication. These were downloaded as PDFs from
of the statutes themselves. Anything the statutes do not define,          the IRS and Tax Court websites, OCR’ed with tesseract if needed,
and which is typically expressed with a single word, is potentially       and otherwise cleaned.
such a predicate: marriage, residing somewhere, someone paying
someone else, etc. The example above is translated in Figure 3.           4.2    Tax vectors
                                                                          Before training a word2vec model [39] on this corpus, we did two
3.2    Cases                                                              tax-specific preprocessing steps to ensure that semantic units re-
The natural lan-                                                          mained together. First, we put underscores between multi-token
guage description          marriage_(alice_and_bob).                      collocations that are tax terms of art, defined in either the tax code,
of each case was           agent_(alice_and_bob, alice).                  Treasury regulations, or a leading tax-law dictionary. Thus, “surviv-
manually translated        agent_(alice_and_bob, bob).                    ing spouse" became the single token “surviving_spouse". Second,
into the facts men-        start_(alice_and_bob, "1993-01-24").           we turned all tax code sections and Treasury regulations into a
tioned above. The                                                         single token, stripped of references to subsections, subparagraphs,
question or log-           Figure 3: Example predicates used.             and subclauses. Thus, “Treas. Reg. §1.162-21(b)(1)(iv)" became the
ical entailment prompt                                                    single token “sec_1_162_21". The vectors were trained at 500 dimen-
was translated into a Prolog query. For instance, Section 7703(b)(3)      sions using skip-gram with negative sampling. A window size of 15
applies to Alice maintaining her home for the year 2018. translates       was found to maximize performance on twelve human-constructed
to s7703_b_3(alice,home,2018). and How much tax does Alice                analogy tasks.
have to pay in 2017? translates to tax(alice,2017,Amount).
   In the broader context of computational statutory reasoning,           4.3    Legal BERT
the Prolog solver has three limitations. First, producing it requires     We performed further training of BERT [18], on a portion of the full
domain experts, while automatic generation is an open question.           case.law corpus, including both state and federal cases. We did
Second, translating natural language into facts requires semantic         not limit the training to tax cases. Rather, the only cases excluded
parsing capabilities. Third, small mistakes can lead to catastrophic      were those under 400 characters (which tend to be summary orders
failure. An orthogonal approach is to replace logical operators and       with little semantic content) and those before 1970 (when judicial
explicit structure with high-dimensional, dense representations and       writing styles had become recognizably modern). We randomly
real-valued functions, both learned using distributional statistics.      selected a subset of the remaining cases, and broke all selected
Such a machine learning-based approach can be adapted to new              cases into chunks of exactly 510 tokens, which is the most BERT’s
legislation and new domains automatically.                                architecture can handle. Any remaining tokens in a selected case
                                                                          were discarded. Using solely the masked language model task (i.e.
4     LEGAL NLP                                                           not next sentence prediction), starting from Bert-Base-Cased, we
As is commonly done in MR, we pretrained our models using two             trained on 900M tokens.
unsupervised learning paradigms on a large corpus of legal text.             The resulting Legal BERT has the exact same architecture as
                                                                          Bert-Base-Cased but parameters better attuned to legal tasks. We
4.1    Text corpus                                                        applied both models to the natural language questions and answers
We curated a corpus consisting solely of freely-available tax law         in the corpus we introduce in this paper. While Bert-Base-Cased
documents with 147M tokens. The first half is drawn from cas [1],         had a perplexity of 14.4, Legal BERT had a perplexity of just 2.7,
a project of Harvard’s Law Library that scanned and OCR’ed many           suggesting that the further training on 900M tokens made the model
of the library’s case-law reporters, making the text available upon       much better adapted to legal queries.
request to researchers. The main challenge in using this resource is         We also probed how this further training impacted ability to
that it contains 1.7M U.S. federal cases, only a small percentage of      handle fine-tuning on downstream tasks. The downstream task we
which are on tax law (as opposed to criminal law, breach of contract,     chose was identifying legal terms in case texts. For this task, we
bankruptcy, etc.). Classifying cases by area is a non-trivial problem     defined legal terms as any tokens or multi-token collocations that
[55], and tax-law cases are litigated in many different courts. We        are defined in Black’s Law Dictionary [25], the premier legal dictio-
used the heuristic of classifying a case as being tax-law if it met one   nary. We split the legal terms into training/dev/test splits. We put
of the following criteria: the Commissioner of Internal Revenue           a 4-layer fully-connected MLP on top of both Bert-Base-Cased
was a party; the case was decided by the U.S. Tax Court; or, the case     and Legal BERT, where the training objective was B-I-O tagging
was decided by any other federal court, other than a trade tribunal,      of tokens in 510-token sequences. We trained both on a set of
A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering                                             NLLP @ KDD 2020, August 24th, San Diego, US


200M tokens randomly selected from case.law cases not previ-                                  Model          Features       Inputs     Entailment      Numerical
ously seen by the model and not containing any of the legal terms                             Baseline       -              -                  50        20 ± 15.8
                                                                                              BERT-          BERT           question           48        20 ± 15.8
in dev or test, with the training legal terms tagged using string
                                                                                              based                         context            55        15 ± 14.1
comparisons. We then tested both fine-tuned models’ ability to
                                                                                                                            statutes           49          5 ± 8.6
identify legal terms from the test split in case law. The model                                              + unfreeze     statutes           53        20 ± 15.8
based on Bert-Base-Cased achieved F1 = 0.35, whereas Legal BERT                                              Legal BERT     question           48          5 ± 8.6
achieved F1 = 0.44. As a baseline, two trained lawyers given the same                                                       context            49          5 ± 8.6
task on three 510-token sequences each achieved F1 = 0.26. These                                                            statutes           48          5 ± 8.6
results indicate that Legal BERT is much better adapted to the legal                                         + unfreeze     statutes           49        15 ± 14.1
domain than Bert-Base-Cased. Black’s Law Dictionary has well-                                 feedforward    tax vectors    question           54        20 ± 15.8
developed standards for what terms are or are not included. BERT                              neural                        context            49        20 ± 15.8
models learn those standards via the train set, whereas lawyers                                                             statutes           50        20 ± 15.8
are not necessarily familiar with them. In addition, pre-processing                                          word2vec       question           50        20 ± 15.8
                                                                                                                            context            50        20 ± 15.8
dropped some legal terms that were subsets of too many others,
                                                                                                                            statutes           50        20 ± 15.8
which the lawyers tended to identify. This explains how BERT-
                                                                                              feedforward    tax vectors    question           51        20 ± 15.8
based models could outperform trained humans.                                                 non-neural                    context            51        20 ± 15.8
5 EXPERIMENTS                                                                                                               statutes           47        25 ± 17.1
                                                                                                             word2vec       question           52        20 ± 15.8
5.1 BERT-based models                                                                                                       context            51        20 ± 15.8
In the following, we frame our task as textual entailment and numer-                                                        statutes           53        20 ± 15.8
ical regression. A given entailment prompt 𝑞 mentions the relevant                          Table 5: Test set scores. We report the 90% confidence inter-
subsection (as in Figure 1)4 . We extract 𝑠, the text of the relevant                       val. All confidence intervals for entailment round to 8.3%.
subsection, from the statutes. In 𝑞, we replace Section XYZ applies
                                                                                            questions are selected separately, during a hyperparameter search
with This applies. We feed the string “[CLS] + 𝑠 + [SEP] + 𝑞 + 𝑐
                                                                                            around the recommended setting (batch size=32, learning rate=1e-
+ [SEP]", where “+" is string concatenation, to BERT [18]. Let 𝑟
                                                                                            5). To check for bias in our dataset, we drop either the statute, or
be the vector representation of the token [CLS] in the final layer.
                                                                                            the context and the statute, in which case we predict the answer
The answer (entailment or contradiction) is predicted as 𝑔(𝜃 1 · 𝑟 )
                                                                                            from BERT’s representation for “[CLS] + 𝑐 + [SEP] + 𝑞 + [SEP]" or
where 𝜃 1 is a learnable parameter and 𝑔 is the sigmoid function.
                                                                                            “[CLS] + 𝑞 + [SEP]", whichever is relevant.
For numerical questions, all statutes have to be taken into account,
which would exceed BERT’s length limit. We encode “[CLS] all                                5.2    Feedforward models
[SEP] + 𝑞 + 𝑐 + [SEP]" into 𝑟 and predict the answer as 𝜇 + 𝜎𝜃 2 · 𝑟                        We follow Arora et al. [2] to embed strings into vectors, with
where 𝜃 2 is a learned parameter, and 𝜇 and 𝜎 are the mean and                              smoothing parameter equal to 10−3 . We use either tax vectors de-
standard deviation of the numerical answers on the training set.                            scribed in Section 4 or word2vec vectors [39]. We estimate unigram
   For entailment, we use a cross-entropy loss, and evaluate the                            counts from the corpus used to build the tax vectors, or the train-
models using accuracy. We frame the numerical questions as a                                ing set, whichever is relevant. For a given context 𝑐 and question
taxpayer having to compute tax owed. By analogy with the concept                            or prompt 𝑞, we retrieve relevant subsection 𝑠 as above. Using
of “substantial understatement of income tax” from §6662(d), we                             Arora et al. [2], 𝑠 is mapped to vector 𝑣𝑠 , and (𝑐, 𝑞) to 𝑣𝑐+𝑞 . Let
                       |𝑦−𝑦ˆ |
define Δ(𝑦, 𝑦)
             ˆ = max(0.1𝑦,5000) where 𝑦 is the true amount of tax                           𝑟 = [𝑣𝑠 , 𝑣𝑞+𝑐 , |𝑣𝑠 − 𝑣𝑐+𝑞 |, 𝑣𝑠 ⊙ 𝑣𝑐+𝑞 ] where [𝑎, 𝑏] is the concatena-
owed, and 𝑦ˆ is the taxpayer’s prediction. The case Δ(𝑦, 𝑦)   ˆ ≥ 1                         tion of 𝑎 and 𝑏, |.| is the element-wise absolute value, and ⊙ is the
corresponds to a substantial over- or understatement of tax. We                             element-wise product. The answer is predicted as 𝑔(𝜃 1 · 𝑓 (𝑟 )) or
compute the fraction of predictions 𝑦ˆ such that Δ(𝑦, 𝑦)  ˆ < 1 and                         𝜇 + 𝜎𝜃 2 · 𝑓 (𝑟 ), as above, where 𝑓 is a feed-forward neural network.
report that as numerical accuracy.5 The loss function used is:                              We use batch normalization between each layer of the neural net-
                                                                                            work [31]. As above, we perform ablation experiments, where we
       Õ                                                Õ                                   drop the statute, or the context and the statute, in which case 𝑟
L=             𝑦𝑖 log 𝑦ˆ𝑖 + (1 − 𝑦𝑖 ) log(1 − 𝑦ˆ𝑖 ) +            max(Δ(𝑦𝑖 , 𝑦ˆ𝑖 ) − 1, 0)
                                                                                            is replaced by 𝑣𝑐+𝑞 or 𝑣𝑞 . We also experiment with 𝑓 being the
      𝑖 ∈𝐼 1                                            𝑖 ∈𝐼 2
                                                                                            identity function (no neural network). Training is otherwise done
where 𝐼 1 (resp. 𝐼 2 ) is the set of entailment (resp. numerical) ques-                     as above, but without the warmup schedule.
tions, 𝑦𝑖 is the ground truth output, and 𝑦ˆ𝑖 is the model’s output.
   We use Adam [34] with a linear warmup schedule for the learning                          5.3    Results
rate. We freeze BERT’s parameters, and experiment with unfreezing                           We report the accuracy on the test set (in %) in Table 5. In our ab-
BERT’s top layer. We select the final model based on early stopping                         lation experiments, “question" models have access to the question
with a random 10% of the training examples reserved as a dev                                only, “context" to the context and question, and “statute" to the
set. The best performing model for entailment and for numerical                             statutes, context and question. For entailment, we use a majority
                                                                                            baseline. For the numerical questions, we find the constant that
4 The code for these experiments can be found under https://github.com/SgfdDttt/sara
5 For a company, a goal would be to have 100% accuracy (resulting in no tax penalties)
                                                                                            minimizes the hinge loss on the training set up to 2 digits: $11,023.
while paying the lowest amount of taxes possible (giving them something of an interest-     As a check, we swapped in the concatenation of the RTE datasets of
free loan, even if the IRS eventually collects the understated tax).                        Bentivogli et al. [5], Dagan et al. [16], Giampiccolo et al. [26], Haim
NLLP @ KDD 2020, August 24th, San Diego, US                                                                Holzenberger, Blair-Stanek and Van Durme


et al. [28], and achieved 73.6% accuracy on the dev set with BERT,        most significant gains recently achieved via contextual language
close to numbers reported in Wang et al. [59]. BERT was trained on        modeling. This line of work is the most related in spirit to where we
Wikipedia, which contains snippets of law text: see article United        believe research in statutory reasoning should focus. An interesting
States Code and links therefrom, especially Internal Revenue Code.        contrast is that while scientific reasoning is based on understanding
Overall, models perform comparably to the baseline, independent           the physical world, which in theory can be informed by all manner
of the underlying method. Performance remains mostly unchanged            of evidence beyond texts, legal reasoning is governed by human-
when dropping the statutes or statutes and context, meaning that          made rules. The latter are true by virtue of being written down and
models are not utilizing the statutes. Adapting BERT or word vec-         agreed to, and are not discovered through evidence and a scientific
tors to the legal domain has no noticeable effect. Our results suggest    process. Thus, statutory reasoning is an exceptionally pure instance
that performance will not be improved through straightforward             of a reasoner needing to understand prescriptive language.
application of a large-scale language model, unlike it is on other           Weston et al. [60] introduced a set of prerequisite toy tasks for
datasets: Raffel et al. [45] achieved 94.8% accuracy on COPA [49]         AI systems, which require some amount of reasoning and common
using a large-scale multitask Transformer model, and BERT pro-            sense knowledge. Contrary to the present work, the types of ques-
vided a huge jump in performance on both SQuAD 2.0 [46] (+8.2             tion in the train and test sets are highly related, and the vocabulary
F1) and SWAG [63] (+27.1 percentage points accuracy) datasets as          overlap is quite high. Numeric reasoning appears in a variety of
compared to predecessor models, pre-trained on smaller datasets.          MR challenges, such as in DROP [19].
   Here, we focus on the creation of resources adapted to the legal          Understanding procedural language – knowledge needed to per-
domain, and on testing off-the-shelf and historical solutions. Future     form a task – is related to the problem of understanding statutes,
work will consider specialized reasoning models.                          and so we provide a brief description of some example investiga-
                                                                          tions in that area. Zhang et al. [65] published a dataset of how-to
6    RELATED WORK                                                         instructions, with human annotations defining key attributes (actee,
There have been several efforts to translate law statutes into expert     purpose...) and models to automatically extract the attributes. Simi-
systems. Oracle Policy Automation has been used to formalize rules        larly, Chowdhury et al. [13] describe a dataset of human-elicited
in a variety of contexts. TAXMAN [38] focuses on corporate reorga-        procedural knowledge, and Wambsganß and Fromm [58] automati-
nization law, and is able to classify a case into three different legal   cally detect repair instructions from posts on an automotive forum.
types of reorganization, following a theorem-proving approach.            Branavan et al. [7] employed text from an instruction manual to
Sergot et al. [52] translate the major part of the British National-      improve the performance of a game-playing agent.
ity Act 1981 into around 150 rules in micro-Prolog, proving the
suitability of Prolog logic to express and apply legislation. Bench-      7   CONCLUSION
Capon et al. [4] further discuss knowledge representation issues.         We introduce a resource of law statutes, a dataset of hand-curated
Closest to our work is Sherman [53], who manually translated part         rules and cases in natural language, and a symbolic solver able
of Canada’s Income Tax Act into a Prolog program. To our knowl-           to represent these rules and solve the challenge task. Our hand-
edge, the projects cited did not include a dataset or task that the       built solver contrasts with our baselines based on current NLP
programs were applied to. Other works have similarly described the        approaches, even when we adapt them to the legal domain.
formalization of law statutes into rule-based systems [24, 30, 32, 51].      The intersection between NLP and the legal domain is a growing
   Yoshioka et al. [62] introduce a dataset of Japanese statute law       area of research [3, 11, 33, 35, 48], but with few large-scale system-
and its English translation, together with questions collected from       atic resources. Thus, in addition to the exciting challenge posed by
the Japanese bar exam. To tackle these two tasks, Kim et al. [33]         statutory reasoning, we also intend this paper to be a contribution
investigate heuristic-based and machine learning-based methods.           to legal-domain natural language processing.
A similar dataset based on the Chinese bar exam was released                 Given the poor out-of-the box performance of otherwise very
by Zhong et al. [66]. Many papers explore case-based reasoning            powerful models, this dataset, which is quite small compared to typ-
for law, with expert systems [43, 56], human annotations [8] or           ical MR resources, raises the question of what the most promising
automatic annotations [3] as well as transformer-based methods            direction of research would be. An important feature of statutory
[44]. Some datasets are concerned with very specific tasks, as in         reasoning is the relative difficulty and expense in generating care-
tagging in contracts [10], classifying clauses [11], and classification   fully constructed training data: legal texts are written for and by
of documents [12] or single paragraphs [6]. Ravichander et al. [47]       lawyers, who are cost-prohibitive to employ in bulk. This is un-
have released a dataset of questions about privacy policies, elicited     like most instances of MR where everyday texts can be annotated
from turkers and answered by legal experts. Saeidi et al. [50] frame      through crowdsourcing services. There are at least three strategies
the task of statutory reasoning as a dialog between a user and a          open to the community: automatic extraction of knowledge graphs
dialog agent. A single rule, with or without context, and a series        from text with the same accuracy as we did for our Prolog solver
of followup questions are needed to answer the original question.         [57]; improvements in MR to be significantly more data efficient in
Contrary to our dataset, rules are isolated from the rest of the body     training; or new mechanisms for the efficient creation of training
of rules, and followup questions are part of the task.                    data based on pre-existing legal cases.
   Clark et al. [14] describe a decades-long effort to answer science        Going forward, we hope our resource provides both (1) a bench-
exam questions stated in natural language, based on descriptive           mark for a challenging aspect of natural legal language processing
knowledge stated in natural language. Their system relies on a            as well as for machine reasoning, and (2) legal-domain NLP models
variety of NLP and specialized reasoning techniques, with their           useful for the research community.
A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering                                                    NLLP @ KDD 2020, August 24th, San Diego, US


REFERENCES                                                                                   [28] R Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo
 [1] 2019. Caselaw Access Project. http://case.law                                                Magnini, and Idan Szpektor. 2006. The second pascal recognising textual entail-
 [2] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2016. A simple but tough-to-beat                 ment challenge. In Proceedings of the Second PASCAL Challenges Workshop on
     baseline for sentence embeddings. (2016).                                                    Recognising Textual Entailment.
 [3] Kevin D Ashley and Stefanie Brüninghaus. 2009. Automatically classifying                [29] Herbert Lionel Adolphus Hart and Herbert Lionel Adolphus Hart. 2012. The
     case texts and predicting outcomes. Artificial Intelligence and Law 17, 2 (2009),            concept of law. Oxford university press.
     125–165.                                                                                [30] Robert Hellawell. 1980. A computer program for legal planning and analysis:
 [4] Trevor JM Bench-Capon, Gwen O Robinson, Tom W Routen, and Marek J Sergot.                    Taxation of stock redemptions. Columbia Law Review 80, 7 (1980), 1363–1398.
     1987. Logic programming for large scale applications in law: A formalisation of         [31] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating
     supplementary benefit legislation. In Proceedings of the 1st international conference        deep network training by reducing internal covariate shift. arXiv preprint
     on Artificial intelligence and law. 190–198.                                                 arXiv:1502.03167 (2015).
 [5] Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The             [32] Imran Khan, Muhammad Sher, Javed I Khan, Syed M Saqlain, Anwar Ghani,
     Fifth PASCAL Recognizing Textual Entailment Challenge.. In TAC.                              Husnain A Naqvi, and Muhammad Usman Ashraf. 2016. Conversion of legal text
 [6] Carlo Biagioli, Enrico Francesconi, Andrea Passerini, Simonetta Montemagni,                  to a logical rules set from medical law using the medical relational model and the
     and Claudia Soria. 2005. Automatic semantics extraction in law documents. In                 world rule model for a medical decision support system. In Informatics, Vol. 3.
     Proceedings of the 10th international conference on Artificial intelligence and law.         Multidisciplinary Digital Publishing Institute, 2.
     133–140.                                                                                [33] Mi-Young Kim, Juliano Rabelo, and Randy Goebel. 2019. Statute Law Informa-
 [7] SRK Branavan, David Silver, and Regina Barzilay. 2012. Learning to win by                    tion Retrieval and Entailment. In Proceedings of the Seventeenth International
     reading manuals in a monte-carlo framework. Journal of Artificial Intelligence               Conference on Artificial Intelligence and Law. 283–289.
     Research 43 (2012), 661–704.                                                            [34] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
 [8] Stefanie Bruninghaus and Kevin D Ashley. 2003. Predicting outcomes of case                   mization. arXiv preprint arXiv:1412.6980 (2014).
     based legal arguments. In Proceedings of the 9th international conference on Artifi-    [35] Anastassia Kornilova and Vladimir Eidelman. 2019. BillSum: A Corpus for
     cial intelligence and law. 233–242.                                                          Automatic Summarization of US Legislation. In Proceedings of the 2nd Workshop
 [9] Hector Neri Castañeda. 1967. Comment on D. Davidson’s “The logical forms of                  on New Frontiers in Summarization. Association for Computational Linguistics,
     action sentences”. The Logic of Decision and Action (1967).                                  Hong Kong, China, 48–56. https://doi.org/10.18653/v1/D19-5406
[10] Ilias Chalkidis and Ion Androutsopoulos. 2017. A Deep Learning Approach to              [36] John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional
     Contract Element Extraction.. In JURIX. 155–164.                                             random fields: Probabilistic models for segmenting and labeling sequence data.
[11] Ilias Chalkidis, Ion Androutsopoulos, and Achilleas Michos. 2018. Obligation and             (2001).
     prohibition extraction using hierarchical rnns. arXiv preprint arXiv:1805.03871         [37] Robert S Ledley and Lee B Lusted. 1959. Reasoning foundations of medical
     (2018).                                                                                      diagnosis. Science 130, 3366 (1959), 9–21.
[12] Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, and Ion Androut-            [38] L Thorne McCarty. 1976. Reflections on TAXMAN: An experiment in artificial
     sopoulos. 2019. Large-Scale Multi-Label Text Classification on EU Legislation.               intelligence and legal reasoning. Harv. L. Rev. 90 (1976), 837.
     arXiv preprint arXiv:1906.02192 (2019).                                                 [39] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
[13] Debajyoti Paul Chowdhury, Arghya Biswas, Tomasz Sosnowski, and Kristina                      Distributed representations of words and phrases and their compositionality. In
     Yordanova. 2020. Towards Evaluating Plan Generation Approaches with Instruc-                 Advances in neural information processing systems. 3111–3119.
     tional Texts. arXiv preprint arXiv:2001.04186 (2020).                                   [40] Randolph A Miller, Harry E Pople Jr, and Jack D Myers. 1982. Internist-I, an
[14] Peter Clark, Oren Etzioni, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson,               experimental computer-based diagnostic consultant for general internal medicine.
     Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon, Sumithra                  New England Journal of Medicine 307, 8 (1982), 468–476.
     Bhakthavatsalam, et al. 2019. From’F’to’A’on the NY Regents Science Exams: An           [41] Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Bishan Yang,
     Overview of the Aristo Project. arXiv preprint arXiv:1909.01958 (2019).                      Justin Betteridge, Andrew Carlson, Bhanava Dalvi, Matt Gardner, Bryan Kisiel,
[15] Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Johan Van Genabith, Jan                 et al. 2018. Never-ending learning. Commun. ACM 61, 5 (2018), 103–115.
     Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, et al. 1996.         [42] Terence Parsons. 1990. Events in the Semantics of English. Vol. 334. MIT press
     Using the framework. Technical Report. Technical Report LRE 62-051 D-16, The                 Cambridge, MA.
     FraCaS Consortium.                                                                      [43] Walter G Popp and Bernhard Schlink. 1974. Judith, a computer program to advise
[16] Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The PASCAL recog-                      lawyers in reasoning a case. Jurimetrics J. 15 (1974), 303.
     nising textual entailment challenge. In Machine Learning Challenges Workshop.           [44] Juliano Rabelo, Mi-Young Kim, and Randy Goebel. 2019. Combining Similarity and
     Springer, 177–190.                                                                           Transformer Methods for Case Law Entailment. In Proceedings of the Seventeenth
[17] Donald Davidson. 1967. The logical forms of action sentences. The Logic of                   International Conference on Artificial Intelligence and Law. 290–296.
     Decision and Action (1967).                                                             [45] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
[18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:                Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the lim-
     Pre-training of deep bidirectional transformers for language understanding. arXiv            its of transfer learning with a unified text-to-text transformer. arXiv preprint
     preprint arXiv:1810.04805 (2018).                                                            arXiv:1910.10683 (2019).
[19] Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh,              [46] Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know:
     and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring                    Unanswerable Questions for SQuAD. In Proceedings of the 55th Annual Meeting
     discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161 (2019).                  of the Association for Computational Linguistics.
[20] Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S Weld. 2008. Open           [47] Abhilasha Ravichander, Alan W Black, Shomir Wilson, Thomas Norton, and
     information extraction from the web. Commun. ACM 51, 12 (2008), 68–74.                       Norman Sadeh. 2019. Question Answering for Privacy Policies: Combining
[21] Edward A Feigenbaum. 1992. Expert systems: principles and practice. (1992).                  Computational and Legal Perspectives. arXiv preprint arXiv:1911.00841 (2019).
[22] David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek,              [48] Edwina L Rissland, Kevin D Ashley, and Ronald Prescott Loui. 2003. AI and Law:
     Aditya A Kalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Prager,                 A fruitful synergy. Artificial Intelligence 150, 1-2 (2003), 1–15.
     et al. 2010. Building Watson: An overview of the DeepQA project. AI magazine            [49] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice
     31, 3 (2010), 59–79.                                                                         of plausible alternatives: An evaluation of commonsense causal reasoning. In
[23] Noah S Friedland, Paul G Allen, Gavin Matthews, Michael Witbrock, David Baxter,              2011 AAAI Spring Symposium Series.
     Jon Curtis, Blake Shepard, Pierluigi Miraglia, Jurgen Angele, Steffen Staab, et al.     [50] Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Singh, Tim Rocktäschel,
     2004. Project halo: Towards a digital aristotle. AI magazine 25, 4 (2004), 29–29.            Mike Sheldon, Guillaume Bouchard, and Sebastian Riedel. 2018. Interpretation
[24] Wachara Fungwacharakorn and Ken Satoh. 2018. Legal Debugging in Proposi-                     of natural language rules in conversational machine reading. arXiv preprint
     tional Legal Representation. In JSAI International Symposium on Artificial Intelli-          arXiv:1809.01494 (2018).
     gence. Springer, 146–159.                                                               [51] Ken Satoh, Kento Asai, Takamune Kogawa, Masahiro Kubota, Megumi Naka-
[25] Bryan A Gardner. 2019. Black’s Law Dictionary (11 ed.).                                      mura, Yoshiaki Nishigai, Kei Shirakawa, and Chiaki Takano. 2010. PROLEG: an
[26] Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The                   implementation of the presupposed ultimate fact theory of Japanese civil code by
     third pascal recognizing textual entailment challenge. In Proceedings of the ACL-            PROLOG technology. In JSAI International Symposium on Artificial Intelligence.
     PASCAL workshop on textual entailment and paraphrasing. Association for Com-                 Springer, 153–164.
     putational Linguistics, 1–9.                                                            [52] Marek J. Sergot, Fariba Sadri, Robert A. Kowalski, Frank Kriwaczek, Peter Ham-
[27] David Gunning, Vinay K Chaudhri, Peter E Clark, Ken Barker, Shaw-Yi Chaw,                    mond, and H Terese Cory. 1986. The British Nationality Act as a logic program.
     Mark Greaves, Benjamin Grosof, Alice Leung, David D McDonald, Sunil Mishra,                  Commun. ACM 29, 5 (1986), 370–386.
     et al. 2010. Project Halo Update—Progress Toward Digital Aristotle. AI Magazine         [53] David M Sherman. 1987. A Prolog model of the income tax act of Canada. In
     31, 3 (2010), 33–58.                                                                         Proceedings of the 1st international conference on Artificial intelligence and law.
                                                                                                  127–136.
NLLP @ KDD 2020, August 24th, San Diego, US                                                                                         Holzenberger, Blair-Stanek and Van Durme


[54] Edward H Shortliffe and Bruce G Buchanan. 1975. A model of inexact reasoning               Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018
     in medicine. Mathematical biosciences 23, 3-4 (1975), 351–379.                             Conference on Empirical Methods in Natural Language Processing. Association
[55] Jerrold Soh, How Khang Lim, and Ian Ernst Chai. 2019. Legal Area Classification:           for Computational Linguistics, Brussels, Belgium, 2369–2380. https://doi.org/10.
     A Comparative Study of Text Classifiers on Singapore Supreme Court Judgments.              18653/v1/D18-1259
     Association for Computational Linguistics, Minneapolis, Minnesota.                    [62] Masaharu Yoshioka, Yoshinobu Kano, Naoki Kiyota, and Ken Satoh. 2018.
[56] Anne vdL Gardner. 1983. The design of a legal analysis program. In AAAI-83.                Overview of japanese statute law retrieval and entailment task at coliee-2018. In
     114–118.                                                                                   Twelfth International Workshop on Juris-informatics (JURISIN 2018).
[57] Lai Dac Viet, Vu Trong Sinh, Nguyen Le Minh, and Ken Satoh. 2017. ConvAMR:            [63] Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-
     Abstract meaning representation parsing for legal document. arXiv preprint                 scale adversarial dataset for grounded commonsense inference. arXiv preprint
     arXiv:1711.06141 (2017).                                                                   arXiv:1808.05326 (2018).
[58] Thiemo Wambsganß and Hansjörg Fromm. 2019. Mining User-Generated Repair               [64] Sheng Zhang, Xutai Ma, Kevin Duh, and Benjamin Van Durme. 2019. Broad-
     Instructions from Automotive Web Communities. In Proceedings of the 52nd                   Coverage Semantic Parsing as Transduction. In Proceedings of the 2019 Conference
     Hawaii International Conference on System Sciences.                                        on Empirical Methods in Natural Language Processing and the 9th International
[59] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael,             Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association
     Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier bench-               for Computational Linguistics, Hong Kong, China, 3786–3798. https://doi.org/
     mark for general-purpose language understanding systems. In Advances in Neural             10.18653/v1/D19-1392
     Information Processing Systems. 3261–3275.                                            [65] Ziqi Zhang, Philip Webster, Victoria S Uren, Andrea Varga, and Fabio Ciravegna.
[60] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Mer-                2012. Automatically Extracting Procedural Knowledge from Instructional Texts
     riënboer, Armand Joulin, and Tomas Mikolov. 2015. Towards ai-complete question             using Natural Language Processing.. In LREC, Vol. 2012. 520–527.
     answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698 (2015).   [66] Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and
[61] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan                 Maosong Sun. 2019. JEC-QA: A Legal-Domain Question Answering Dataset.
     Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for                   arXiv preprint arXiv:1911.12011 (2019).