<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Dataset for Statutory Reasoning in Tax Law Entailment and uQestion Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nils Holzenberger</string-name>
          <email>nilsh@jhu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew Blair-Stanek</string-name>
          <email>ablair-stanek@law.umaryland.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Van Durme</string-name>
          <email>vandurme@cs.jhu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Johns Hopkins University</institution>
          ,
          <addr-line>Baltimore, Maryland</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>U. of Maryland Carey School of Law</institution>
          ,
          <addr-line>Baltimore, Maryland</addr-line>
          ,
          <country country="US">USA</country>
          ,
          <institution>Johns Hopkins University</institution>
          ,
          <addr-line>Baltimore, Maryland</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>Legislation can be viewed as a body of prescriptive rules expressed in natural language. The application of legislation to facts of a case we refer to as statutory reasoning, where those facts are also expressed in natural language. Computational statutory reasoning is distinct from most existing work in machine reading, in that much of the information needed for deciding a case is declared exactly once (a law), while the information needed in much of machine reading tends to be learned through distributional language statistics. To investigate the performance of natural language understanding approaches on statutory reasoning, we introduce a dataset, together with a legal-domain text corpus. Straightforward application of machine reading models exhibits low out-of-the-box performance on our questions, whether or not they have been fine-tuned to the legal domain. We contrast this with a hand-constructed Prolog-based system, designed to fully solve the task. These experiments support a discussion of the challenges facing statutory reasoning moving forward, which we argue is an interesting real-world task that can motivate the development of models able to utilize prescriptive rules specified in natural language.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Applied computing → Law; • Computing methodologies →</p>
      <sec id="sec-1-1">
        <title>Natural language processing; Knowledge representation and reasoning.</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        Early artificial intelligence research focused on highly-performant,
narrow-domain reasoning models, for instance in health [
        <xref ref-type="bibr" rid="ref37 ref40 ref54">37, 40,
54</xref>
        ] and law [
        <xref ref-type="bibr" rid="ref30 ref38">30, 38</xref>
        ]. Such expert systems relied on hand-crafted
inference rules and domain knowledge, expressed and stored with
the formalisms provided by databases [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. The main bottleneck of
this approach is that experts are slow in building such knowledge
bases and exhibit imperfect recall, which motivated research into
models for automatic information extraction (e.g. Laferty et al . [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]).
Systems for large-scale automatic knowledge base construction
have improved (e.g. Etzioni et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], Mitchell et al. [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ]), as well
as systems for sentence level semantic parsing [
        <xref ref-type="bibr" rid="ref64">64</xref>
        ]. Among others,
this efort has led to question-answering systems for games [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
and, more recently, for science exams [
        <xref ref-type="bibr" rid="ref14 ref23 ref27">14, 23, 27</xref>
        ]. The challenges
include extracting ungrounded knowledge from semi-structured
sources, e.g. textbooks, and connecting high-performance symbolic
solvers with large-scale language models.
      </p>
      <p>
        In parallel, models have begun to consider task definitions like
Machine Reading (MR) [
        <xref ref-type="bibr" rid="ref46">46</xref>
        ] and Recognizing Textual Entailment
(RTE) [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ] as not requiring the use of explicit structure. Instead,
the problem is cast as one of mapping inputs to high-dimensional,
dense representations that implicitly encode meaning [
        <xref ref-type="bibr" rid="ref18 ref45">18, 45</xref>
        ], and
are employed in building classifiers or text decoders, bypassing
classic approaches to symbolic inference.
      </p>
      <p>
        This work is concerned with the problem of statutory reasoning
[
        <xref ref-type="bibr" rid="ref62 ref66">62, 66</xref>
        ]: how to reason about an example situation, a case, based
on complex rules provided in natural language. In addition to the
reasoning aspect, we are motivated by the lack of contemporary
systems to suggest legal opinions: while there exist tools to aid
lawyers in retrieving relevant documents for a given case, we are
unaware of any strong capabilities in automatic statutory reasoning.
      </p>
      <p>
        Our contributions, summarized in Figure 2, include a novel
dataset based on US tax law, together with test cases (Section 2).
Decades-old work in expert systems could solve problems of the
sort we construct here, based on manually derived rules: we
replicate that approach in a Prolog-based system that achieves 100%
accuracy on our examples (Section 3). Our results demonstrate
that straightforward application of contemporary Machine
Reading models is not suficient for our challenge examples (Section
5), whether or not they were adapted to the legal domain (Section
4). This is meant to provoke the question of whether we should
be concerned with: (a) improving methods in semantic parsing
in order to replace manual transduction into symbolic form; or
(b) improving machine reading methods in order to avoid explicit
symbolic solvers. We view this work as part of the conversation
including recent work in multi-hop inference [
        <xref ref-type="bibr" rid="ref61">61</xref>
        ], where our task
is more domain-specific but potentially more challenging.
Here, we describe our main contribution, the StAtutory
Reasoning Assessment dataset (SARA): a set of rules extracted from the
statutes of the US Internal Revenue Code (IRC), together with a
set of natural language questions which may only be answered
correctly by referring to the rules1.
      </p>
      <p>The IRC2 contains rules and definitions for the imposition and
calculation of taxes. It is subdvided into sections, which in general,
define one or more terms: section 3306 defines the terms
employment, employer and wages, for purposes of the federal
unemployment tax. Sections are typically structured around a general rule,
followed by a number of exceptions. Each section and its
subsections may be cast as a predicate whose truth value can be checked
against a state of the world. For instance, subsection 7703(a)(2):
an individual legally separated from his spouse under
a decree of divorce or of separate maintenance shall not
be considered as married
can be checked given an individual.
1The dataset can be found under https://nlp.jhu.edu/law/
2https://uscode.house.gov/browse/prelim@title26&amp;edition=prelim</p>
      <p>Slots are another major feature of the law. Each subsection refers
to a certain number of slots, which may be filled by existing entities
(in the above, individual, spouse, and decree of divorce or of separate
maintenance). Certain slots are implicitly filled: §7703(a)(1) and
(b)(3) mention a “spouse", which must exist since the “individual" is
married. Similarly, slots which have been filled earlier in the section
may be referred to later on. For instance, “household" is mentioned
for the first time in §7703(b)(1), then again in §7703(b)(2) and in
§7703(b)(3). Correctly resolving slots is a key point in successfully
applying the law.</p>
      <p>
        Overall, the IRC can be framed as a set of predicates formulated in
human language. The language used to express the law has an open
texture [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], which makes it particularly challenging for a
computerbased system to determine whether a subsection applies, and to
identify and fill the slots mentioned. This makes the IRC an excellent
corpus to build systems that reason with rules specified in natural
language, and have good language understanding capabilities.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Statutes and test cases</title>
      <p>As the basis of our set of rules, we selected sections of the IRC
well-supported by Treasury Regulations, covering tax on
individuals (§1), marriage and other legal statuses (§2, 7703), dependents
(§152), tax exemptions and deductions (§63, 68, 151) and
employment (§3301, 3306). We simplified the sections to (1) remove highly
specific sections (e.g. those concerning the employment of sailors)
in order to keep the statutes to a manageable size, and (2) ensure
that the sections only refer to sections from the selected subset. For
ease of comparison with the original statutes, we kept the original
numbering and lettering, with no adjustment for removed sections.
For example, there is a section 63(d) and a section 63(f), but no
section 63(e). We assumed that any taxable year starts and ends at
the same time as the corresponding calendar year.</p>
      <p>For each subsection extracted from the statutes, we manually
created two paragraphs in natural language describing a case, one
where the statute applies, and one where it does not. These snippets,
formulated as a logical entailment task, are meant to test a system’s
understanding of the statutes, as illustrated in Figure 1. The cases
were vetted by a law professor for coherence and plausibility. For
the purposes of machine learning, the cases were split into 176
train and 100 test samples, such that (1) each pair of positive and
negative cases belongs to the same split, and (2) each section is split
between train and test in the same proportions as the overall split.</p>
      <p>Since tax legislation makes it possible to predict how much tax
a person owes, we created an additional set of 100 cases where the
task is to predict how much tax someone owes. Those cases were
created by randomly mixing and matching pairs of cases from the
ifrst set of cases, and resolving inconsistencies manually. Those
cases are no longer a binary prediction task, but a task of predicting
an integer. The prediction results from taking into account the
entirety of the statutes, and involves basic arithmetic. The 100 cases
were randomly split into 80 training and 20 test samples.</p>
      <p>Because the statutes were simplified, the answers to the cases
are not those that would be obtained with the current version of
the IRC. Some of the IRC counterparts of the statutes in our dataset
have been repealed, amended, or adjusted to reflect inflation.
Cross-references Explicit Implicit
Within the section 30 25</p>
      <p>To another section 34 44
Table 1: Number of subsections containing cross-references
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Key features of the corpus</title>
      <p>While the corpus is based on a simplification of the Internal
Revenue Code, care was taken to retain prominent features of US law.
We note that the present task is only one aspect of legal
reasoning, which in general involves many more modes of reasoning, in
particular interpreting regulations and prior judicial decisions. The
following features are quantified in Tables 1 to 4.</p>
      <p>Reasoning with time. The timing of events (marriage, retirement,
income...) is highly relevant to determining whether certain sections
apply, as tax is paid yearly. In total, 62 sections refer to time. Some
sections require counting days, as in §7703(b)(1):
a household which constitutes for more than one-half of
the taxable year the principal place of abode of a child
or taking into account the absolute point in time as in §63(c)(7):</p>
      <sec id="sec-4-1">
        <title>In the case of a taxable year beginning after December 31, 2017, and before January 1, 2026</title>
        <p>Exceptions and substitutions. Typically, each section of the IRC
starts by defining a general case and then enumerates a number of
exceptions to the rule. Additionally, some rules involve applying
a rule after substituting terms. A total of 50 sections formulate an
exception or a substitution. As an example, §63(f)(3):</p>
      </sec>
      <sec id="sec-4-2">
        <title>In the case of an individual who is not married and is not a surviving spouse, paragraphs (1) and (2) shall be applied by substituting “$750" for “$600".</title>
        <p>Numerical reasoning. Computing tax owed requires knowledge of
the basic arithmetic operations of adding, subtracting, multiplying,
dividing, rounding and comparing numbers. 55 sections involve
numerical reasoning. The operation to be used needs to be parsed
out of natural text, as in §1(c)(2):
$3,315, plus 28% of the excess over $22,100 if the taxable
income is over $22,100 but not over $53,500</p>
        <p>Cross-references. Each section of the IRC will typically reference
other sections. Table 1 shows how this feature was preserved in
our dataset. There are explicit references within the same section,
as in §7703(b)(1):
an individual who is married (within the meaning of
subsection (a)) and who files a separate return
explicit references to another section, as in §3301:</p>
      </sec>
      <sec id="sec-4-3">
        <title>There is hereby imposed on every employer (as defined</title>
        <p>in section 3306(a)) for each calendar year an excise tax
and implicit references, as in §151(a), where “taxable income" is
defined in §63:
the exemptions provided by this section shall be allowed
as deductions in computing taxable income.</p>
        <p>
          Common sense knowledge. Four concepts, other than time, are left
undefined in our statutes: (1) kinship, (2) the fact that a marriage
ends if either spouse dies, (3) if an event has not ended, then it
is ongoing; if an event has no start, it has been true at any time
before it ends; and some events are instantaneous (e.g. payments ),
It has been shown that subsets of statutes can be expressed in
firstorder logic, as described in Section 6. As a reafirmation of this, and
as a topline for our task, we have manually translated the statutes
into Prolog rules and the cases into Prolog facts, such that each case
can be answered correctly by a single Prolog query3. The Prolog
rules were developed based on the statutes, meaning that the Prolog
code clearly reflects the semantics of the textual form, as in Gunning
et al. [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. This is primarily meant as a proof that a carefully crafted
reasoning engine, with perfect natural language understanding, can
solve this dataset. There certainly are other ways of representing
this given set of statutes and cases. The point of this dataset is not
to design a better Prolog system, but to help the development of
language understanding models capable of reasoning.
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Statutes</title>
      <p>Each subsection of the statutes was translated with a single rule,
true if the section applies, false otherwise. In addition, subsections
define slots that may be filled and reused in other subsections, as
described in Section 2. To solve this coreference problem, any term
appearing in a subsection and relevant across subsections is turned
3The Prolog program can be found under https://nlp.jhu.edu/law/
into an argument of the Prolog rule. The corresponding variable
may then be bound during the execution of a rule, and reused in a
rule executed later. Unfilled slots correspond to unbound variables.</p>
      <p>
        To check whether a given subsection applies, the Prolog
system needs to rely on certain predicates, which directly reflect the
facts contained in the natural language descriptions of the cases.
For instance, how do we translate Alice and Bob got married on
January 24th, 1993 into code usable by Prolog? We rely on a set
of 61 predicates, following neo-davidsonian semantics [
        <xref ref-type="bibr" rid="ref17 ref42 ref9">9, 17, 42</xref>
        ].
The level of detail of these predicates is based on the granularity
of the statutes themselves. Anything the statutes do not define,
and which is typically expressed with a single word, is potentially
such a predicate: marriage, residing somewhere, someone paying
someone else, etc. The example above is translated in Figure 3.
3.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>Cases</title>
      <p>The natural
language description marriage_(alice_and_bob).
of each case was agent_(alice_and_bob, alice).
manually translated agent_(alice_and_bob, bob).
into the facts men- start_(alice_and_bob, "1993-01-24").
tioned above. The
question or log- Figure 3: Example predicates used.
ical entailment prompt
was translated into a Prolog query. For instance, Section 7703(b)(3)
applies to Alice maintaining her home for the year 2018. translates
to s7703_b_3(alice,home,2018). and How much tax does Alice
have to pay in 2017? translates to tax(alice,2017,Amount).</p>
      <p>In the broader context of computational statutory reasoning,
the Prolog solver has three limitations. First, producing it requires
domain experts, while automatic generation is an open question.
Second, translating natural language into facts requires semantic
parsing capabilities. Third, small mistakes can lead to catastrophic
failure. An orthogonal approach is to replace logical operators and
explicit structure with high-dimensional, dense representations and
real-valued functions, both learned using distributional statistics.
Such a machine learning-based approach can be adapted to new
legislation and new domains automatically.
4</p>
    </sec>
    <sec id="sec-7">
      <title>LEGAL NLP</title>
      <p>As is commonly done in MR, we pretrained our models using two
unsupervised learning paradigms on a large corpus of legal text.
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>Text corpus</title>
      <p>
        We curated a corpus consisting solely of freely-available tax law
documents with 147M tokens. The first half is drawn from cas [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
a project of Harvard’s Law Library that scanned and OCR’ed many
of the library’s case-law reporters, making the text available upon
request to researchers. The main challenge in using this resource is
that it contains 1.7M U.S. federal cases, only a small percentage of
which are on tax law (as opposed to criminal law, breach of contract,
bankruptcy, etc.). Classifying cases by area is a non-trivial problem
[
        <xref ref-type="bibr" rid="ref55">55</xref>
        ], and tax-law cases are litigated in many diferent courts. We
used the heuristic of classifying a case as being tax-law if it met one
of the following criteria: the Commissioner of Internal Revenue
was a party; the case was decided by the U.S. Tax Court; or, the case
was decided by any other federal court, other than a trade tribunal,
with the United States as a party, and with the word tax appearing
in the first 400 words of the case’s written opinion.
      </p>
      <p>The second half of this corpus consists of IRS private letter
rulings and unpublished U.S. Tax Court cases. IRS private letter rulings
are similar to cases, in that they apply tax law to one taxpayer’s
facts; they difer from cases in that they are written by IRS attorneys
(not judges), have less precedential authority than cases, and redact
names to protect taxpayer privacy. Unpublished U.S. Tax Court
cases are viewed by the judges writing them as less important than
those worthy of publication. These were downloaded as PDFs from
the IRS and Tax Court websites, OCR’ed with tesseract if needed,
and otherwise cleaned.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>Tax vectors</title>
      <p>
        Before training a word2vec model [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] on this corpus, we did two
tax-specific preprocessing steps to ensure that semantic units
remained together. First, we put underscores between multi-token
collocations that are tax terms of art, defined in either the tax code,
Treasury regulations, or a leading tax-law dictionary. Thus,
“surviving spouse" became the single token “surviving_spouse". Second,
we turned all tax code sections and Treasury regulations into a
single token, stripped of references to subsections, subparagraphs,
and subclauses. Thus, “Treas. Reg. §1.162-21(b)(1)(iv)" became the
single token “sec_1_162_21". The vectors were trained at 500
dimensions using skip-gram with negative sampling. A window size of 15
was found to maximize performance on twelve human-constructed
analogy tasks.
4.3
      </p>
    </sec>
    <sec id="sec-10">
      <title>Legal BERT</title>
      <p>
        We performed further training of BERT [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], on a portion of the full
case.law corpus, including both state and federal cases. We did
not limit the training to tax cases. Rather, the only cases excluded
were those under 400 characters (which tend to be summary orders
with little semantic content) and those before 1970 (when judicial
writing styles had become recognizably modern). We randomly
selected a subset of the remaining cases, and broke all selected
cases into chunks of exactly 510 tokens, which is the most BERT’s
architecture can handle. Any remaining tokens in a selected case
were discarded. Using solely the masked language model task (i.e.
not next sentence prediction), starting from Bert-Base-Cased, we
trained on 900M tokens.
      </p>
      <p>The resulting Legal BERT has the exact same architecture as
Bert-Base-Cased but parameters better attuned to legal tasks. We
applied both models to the natural language questions and answers
in the corpus we introduce in this paper. While Bert-Base-Cased
had a perplexity of 14.4, Legal BERT had a perplexity of just 2.7,
suggesting that the further training on 900M tokens made the model
much better adapted to legal queries.</p>
      <p>
        We also probed how this further training impacted ability to
handle fine-tuning on downstream tasks. The downstream task we
chose was identifying legal terms in case texts. For this task, we
defined legal terms as any tokens or multi-token collocations that
are defined in Black’s Law Dictionary [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], the premier legal
dictionary. We split the legal terms into training/dev/test splits. We put
a 4-layer fully-connected MLP on top of both Bert-Base-Cased
and Legal BERT, where the training objective was B-I-O tagging
of tokens in 510-token sequences. We trained both on a set of
200M tokens randomly selected from case.law cases not
previously seen by the model and not containing any of the legal terms
in dev or test, with the training legal terms tagged using string
comparisons. We then tested both fine-tuned models’ ability to
identify legal terms from the test split in case law. The model
based on Bert-Base-Cased achieved F1 = 0.35, whereas Legal BERT
achieved F1 = 0.44. As a baseline, two trained lawyers given the same
task on three 510-token sequences each achieved F1 = 0.26. These
results indicate that Legal BERT is much better adapted to the legal
domain than Bert-Base-Cased. Black’s Law Dictionary has
welldeveloped standards for what terms are or are not included. BERT
models learn those standards via the train set, whereas lawyers
are not necessarily familiar with them. In addition, pre-processing
dropped some legal terms that were subsets of too many others,
which the lawyers tended to identify. This explains how
BERTbased models could outperform trained humans.
5
5.1
      </p>
    </sec>
    <sec id="sec-11">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-12">
      <title>BERT-based models</title>
      <p>
        In the following, we frame our task as textual entailment and
numerical regression. A given entailment prompt  mentions the relevant
subsection (as in Figure 1)4. We extract , the text of the relevant
subsection, from the statutes. In , we replace Section XYZ applies
with This applies. We feed the string “[CLS] +  + [SEP] +  + 
+ [SEP]", where “+" is string concatenation, to BERT [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Let 
be the vector representation of the token [CLS] in the final layer.
The answer (entailment or contradiction) is predicted as (1 · )
where 1 is a learnable parameter and  is the sigmoid function.
For numerical questions, all statutes have to be taken into account,
which would exceed BERT’s length limit. We encode “[CLS] all
[SEP] +  +  + [SEP]" into  and predict the answer as  + 2 ·
where 2 is a learned parameter, and  and  are the mean and
standard deviation of the numerical answers on the training set.
      </p>
      <p>For entailment, we use a cross-entropy loss, and evaluate the
models using accuracy. We frame the numerical questions as a
taxpayer having to compute tax owed. By analogy with the concept
of “substantial understatement of income tax” from §6662(d), we
define Δ(, ˆ) = max(|0.−1ˆ,5|000) where  is the true amount of tax
owed, and ˆ is the taxpayer’s prediction. The case Δ(, ˆ) ≥ 1
corresponds to a substantial over- or understatement of tax. We
compute the fraction of predictions ˆ such that Δ(, ˆ) &lt; 1 and
report that as numerical accuracy.5 The loss function used is:
 log ˆ + (1 −  ) log(1 − ˆ ) +
max(Δ(, ˆ ) − 1, 0)
Õ
 ∈2
L =
Õ
 ∈1
where 1 (resp. 2) is the set of entailment (resp. numerical)
questions,  is the ground truth output, and ˆ is the model’s output.</p>
      <p>
        We use Adam [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ] with a linear warmup schedule for the learning
rate. We freeze BERT’s parameters, and experiment with unfreezing
BERT’s top layer. We select the final model based on early stopping
with a random 10% of the training examples reserved as a dev
set. The best performing model for entailment and for numerical
4The code for these experiments can be found under https://github.com/SgfdDttt/sara
5For a company, a goal would be to have 100% accuracy (resulting in no tax penalties)
while paying the lowest amount of taxes possible (giving them something of an
interestfree loan, even if the IRS eventually collects the understated tax).
questions are selected separately, during a hyperparameter search
around the recommended setting (batch size=32, learning
rate=1e5). To check for bias in our dataset, we drop either the statute, or
the context and the statute, in which case we predict the answer
from BERT’s representation for “[CLS] +  + [SEP] +  + [SEP]" or
“[CLS] +  + [SEP]", whichever is relevant.
We follow Arora et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to embed strings into vectors, with
smoothing parameter equal to 10−3. We use either tax vectors
described in Section 4 or word2vec vectors [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ]. We estimate unigram
counts from the corpus used to build the tax vectors, or the
training set, whichever is relevant. For a given context  and question
or prompt , we retrieve relevant subsection  as above. Using
Arora et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],  is mapped to vector  , and (, ) to + . Let
 = [, + , | − + |,  ⊙ + ] where [, ] is the
concatenation of  and , |.| is the element-wise absolute value, and ⊙ is the
element-wise product. The answer is predicted as (1 ·  ( )) or
 + 2 ·  ( ), as above, where  is a feed-forward neural network.
We use batch normalization between each layer of the neural
network [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. As above, we perform ablation experiments, where we
drop the statute, or the context and the statute, in which case 
is replaced by + or  . We also experiment with  being the
identity function (no neural network). Training is otherwise done
as above, but without the warmup schedule.
We report the accuracy on the test set (in %) in Table 5. In our
ablation experiments, “question" models have access to the question
only, “context" to the context and question, and “statute" to the
statutes, context and question. For entailment, we use a majority
baseline. For the numerical questions, we find the constant that
minimizes the hinge loss on the training set up to 2 digits: $11,023.
As a check, we swapped in the concatenation of the RTE datasets of
Bentivogli et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Dagan et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], Giampiccolo et al. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], Haim
et al. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], and achieved 73.6% accuracy on the dev set with BERT,
close to numbers reported in Wang et al. [
        <xref ref-type="bibr" rid="ref59">59</xref>
        ]. BERT was trained on
Wikipedia, which contains snippets of law text: see article United
States Code and links therefrom, especially Internal Revenue Code.
Overall, models perform comparably to the baseline, independent
of the underlying method. Performance remains mostly unchanged
when dropping the statutes or statutes and context, meaning that
models are not utilizing the statutes. Adapting BERT or word
vectors to the legal domain has no noticeable efect. Our results suggest
that performance will not be improved through straightforward
application of a large-scale language model, unlike it is on other
datasets: Rafel et al . [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ] achieved 94.8% accuracy on COPA [
        <xref ref-type="bibr" rid="ref49">49</xref>
        ]
using a large-scale multitask Transformer model, and BERT
provided a huge jump in performance on both SQuAD 2.0 [
        <xref ref-type="bibr" rid="ref46">46</xref>
        ] (+8.2
F1) and SWAG [
        <xref ref-type="bibr" rid="ref63">63</xref>
        ] (+27.1 percentage points accuracy) datasets as
compared to predecessor models, pre-trained on smaller datasets.
      </p>
      <p>Here, we focus on the creation of resources adapted to the legal
domain, and on testing of-the-shelf and historical solutions. Future
work will consider specialized reasoning models.
6</p>
    </sec>
    <sec id="sec-13">
      <title>RELATED WORK</title>
      <p>
        There have been several eforts to translate law statutes into expert
systems. Oracle Policy Automation has been used to formalize rules
in a variety of contexts. TAXMAN [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ] focuses on corporate
reorganization law, and is able to classify a case into three diferent legal
types of reorganization, following a theorem-proving approach.
Sergot et al. [
        <xref ref-type="bibr" rid="ref52">52</xref>
        ] translate the major part of the British
Nationality Act 1981 into around 150 rules in micro-Prolog, proving the
suitability of Prolog logic to express and apply legislation.
BenchCapon et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] further discuss knowledge representation issues.
Closest to our work is Sherman [
        <xref ref-type="bibr" rid="ref53">53</xref>
        ], who manually translated part
of Canada’s Income Tax Act into a Prolog program. To our
knowledge, the projects cited did not include a dataset or task that the
programs were applied to. Other works have similarly described the
formalization of law statutes into rule-based systems [
        <xref ref-type="bibr" rid="ref24 ref30 ref32 ref51">24, 30, 32, 51</xref>
        ].
      </p>
      <p>
        Yoshioka et al. [
        <xref ref-type="bibr" rid="ref62">62</xref>
        ] introduce a dataset of Japanese statute law
and its English translation, together with questions collected from
the Japanese bar exam. To tackle these two tasks, Kim et al. [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]
investigate heuristic-based and machine learning-based methods.
A similar dataset based on the Chinese bar exam was released
by Zhong et al. [
        <xref ref-type="bibr" rid="ref66">66</xref>
        ]. Many papers explore case-based reasoning
for law, with expert systems [
        <xref ref-type="bibr" rid="ref43 ref56">43, 56</xref>
        ], human annotations [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or
automatic annotations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] as well as transformer-based methods
[
        <xref ref-type="bibr" rid="ref44">44</xref>
        ]. Some datasets are concerned with very specific tasks, as in
tagging in contracts [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], classifying clauses [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and classification
of documents [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or single paragraphs [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Ravichander et al. [
        <xref ref-type="bibr" rid="ref47">47</xref>
        ]
have released a dataset of questions about privacy policies, elicited
from turkers and answered by legal experts. Saeidi et al. [
        <xref ref-type="bibr" rid="ref50">50</xref>
        ] frame
the task of statutory reasoning as a dialog between a user and a
dialog agent. A single rule, with or without context, and a series
of followup questions are needed to answer the original question.
Contrary to our dataset, rules are isolated from the rest of the body
of rules, and followup questions are part of the task.
      </p>
      <p>
        Clark et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] describe a decades-long efort to answer science
exam questions stated in natural language, based on descriptive
knowledge stated in natural language. Their system relies on a
variety of NLP and specialized reasoning techniques, with their
most significant gains recently achieved via contextual language
modeling. This line of work is the most related in spirit to where we
believe research in statutory reasoning should focus. An interesting
contrast is that while scientific reasoning is based on understanding
the physical world, which in theory can be informed by all manner
of evidence beyond texts, legal reasoning is governed by
humanmade rules. The latter are true by virtue of being written down and
agreed to, and are not discovered through evidence and a scientific
process. Thus, statutory reasoning is an exceptionally pure instance
of a reasoner needing to understand prescriptive language.
      </p>
      <p>
        Weston et al. [
        <xref ref-type="bibr" rid="ref60">60</xref>
        ] introduced a set of prerequisite toy tasks for
AI systems, which require some amount of reasoning and common
sense knowledge. Contrary to the present work, the types of
question in the train and test sets are highly related, and the vocabulary
overlap is quite high. Numeric reasoning appears in a variety of
MR challenges, such as in DROP [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        Understanding procedural language – knowledge needed to
perform a task – is related to the problem of understanding statutes,
and so we provide a brief description of some example
investigations in that area. Zhang et al. [
        <xref ref-type="bibr" rid="ref65">65</xref>
        ] published a dataset of how-to
instructions, with human annotations defining key attributes (actee,
purpose...) and models to automatically extract the attributes.
Similarly, Chowdhury et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] describe a dataset of human-elicited
procedural knowledge, and Wambsganß and Fromm [
        <xref ref-type="bibr" rid="ref58">58</xref>
        ]
automatically detect repair instructions from posts on an automotive forum.
Branavan et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] employed text from an instruction manual to
improve the performance of a game-playing agent.
7
      </p>
    </sec>
    <sec id="sec-14">
      <title>CONCLUSION</title>
      <p>We introduce a resource of law statutes, a dataset of hand-curated
rules and cases in natural language, and a symbolic solver able
to represent these rules and solve the challenge task. Our
handbuilt solver contrasts with our baselines based on current NLP
approaches, even when we adapt them to the legal domain.</p>
      <p>
        The intersection between NLP and the legal domain is a growing
area of research [
        <xref ref-type="bibr" rid="ref11 ref3 ref33 ref35 ref48">3, 11, 33, 35, 48</xref>
        ], but with few large-scale
systematic resources. Thus, in addition to the exciting challenge posed by
statutory reasoning, we also intend this paper to be a contribution
to legal-domain natural language processing.
      </p>
      <p>
        Given the poor out-of-the box performance of otherwise very
powerful models, this dataset, which is quite small compared to
typical MR resources, raises the question of what the most promising
direction of research would be. An important feature of statutory
reasoning is the relative dificulty and expense in generating
carefully constructed training data: legal texts are written for and by
lawyers, who are cost-prohibitive to employ in bulk. This is
unlike most instances of MR where everyday texts can be annotated
through crowdsourcing services. There are at least three strategies
open to the community: automatic extraction of knowledge graphs
from text with the same accuracy as we did for our Prolog solver
[
        <xref ref-type="bibr" rid="ref57">57</xref>
        ]; improvements in MR to be significantly more data eficient in
training; or new mechanisms for the eficient creation of training
data based on pre-existing legal cases.
      </p>
      <p>Going forward, we hope our resource provides both (1) a
benchmark for a challenging aspect of natural legal language processing
as well as for machine reasoning, and (2) legal-domain NLP models
useful for the research community.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <fpage>2019</fpage>
          .
          <article-title>Caselaw Access Project</article-title>
          . http://case.law
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Sanjeev</given-names>
            <surname>Arora</surname>
          </string-name>
          , Yingyu Liang, and Tengyu Ma.
          <year>2016</year>
          .
          <article-title>A simple but tough-to-beat baseline for sentence embeddings</article-title>
          . (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kevin</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Ashley</surname>
            and
            <given-names>Stefanie</given-names>
          </string-name>
          <string-name>
            <surname>Brüninghaus</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Automatically classifying case texts and predicting outcomes</article-title>
          .
          <source>Artificial Intelligence and Law</source>
          <volume>17</volume>
          ,
          <issue>2</issue>
          (
          <year>2009</year>
          ),
          <fpage>125</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Trevor</surname>
            <given-names>JM</given-names>
          </string-name>
          <string-name>
            <surname>Bench-Capon</surname>
          </string-name>
          ,
          <source>Gwen O Robinson</source>
          , Tom W Routen, and
          <string-name>
            <surname>Marek</surname>
          </string-name>
          J Sergot.
          <year>1987</year>
          .
          <article-title>Logic programming for large scale applications in law: A formalisation of supplementary benefit legislation</article-title>
          .
          <source>In Proceedings of the 1st international conference on Artificial intelligence and law</source>
          .
          <volume>190</volume>
          -
          <fpage>198</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Luisa</given-names>
            <surname>Bentivogli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Clark</surname>
          </string-name>
          , Ido Dagan, and
          <string-name>
            <given-names>Danilo</given-names>
            <surname>Giampiccolo</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>The Fifth PASCAL Recognizing Textual Entailment Challenge.</article-title>
          .
          <source>In TAC.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Carlo</given-names>
            <surname>Biagioli</surname>
          </string-name>
          , Enrico Francesconi, Andrea Passerini, Simonetta Montemagni, and
          <string-name>
            <given-names>Claudia</given-names>
            <surname>Soria</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Automatic semantics extraction in law documents</article-title>
          .
          <source>In Proceedings of the 10th international conference on Artificial intelligence and law</source>
          .
          <volume>133</volume>
          -
          <fpage>140</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>SRK</given-names>
            <surname>Branavan</surname>
          </string-name>
          , David Silver,
          <string-name>
            <given-names>and Regina</given-names>
            <surname>Barzilay</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Learning to win by reading manuals in a monte-carlo framework</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          <volume>43</volume>
          (
          <year>2012</year>
          ),
          <fpage>661</fpage>
          -
          <lpage>704</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Stefanie</given-names>
            <surname>Bruninghaus</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kevin D</given-names>
            <surname>Ashley</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Predicting outcomes of case based legal arguments</article-title>
          .
          <source>In Proceedings of the 9th international conference on Artificial intelligence and law</source>
          .
          <volume>233</volume>
          -
          <fpage>242</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Hector</given-names>
            <surname>Neri Castañeda</surname>
          </string-name>
          .
          <year>1967</year>
          .
          <article-title>Comment on D. Davidson's “The logical forms of action sentences”. The Logic of Decision and Action (</article-title>
          <year>1967</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Ilias</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ion</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A Deep Learning Approach to Contract Element Extraction.</article-title>
          .
          <source>In JURIX</source>
          .
          <volume>155</volume>
          -
          <fpage>164</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ilias</surname>
            <given-names>Chalkidis</given-names>
          </string-name>
          , Ion Androutsopoulos, and
          <string-name>
            <given-names>Achilleas</given-names>
            <surname>Michos</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Obligation and prohibition extraction using hierarchical rnns</article-title>
          . arXiv preprint arXiv:
          <year>1805</year>
          .
          <volume>03871</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ilias</surname>
            <given-names>Chalkidis</given-names>
          </string-name>
          , Manos Fergadiotis, Prodromos Malakasiotis, and
          <string-name>
            <given-names>Ion</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Large-Scale Multi-Label Text Classification on EU Legislation</article-title>
          . arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>02192</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Debajyoti</surname>
            <given-names>Paul Chowdhury</given-names>
          </string-name>
          , Arghya Biswas, Tomasz Sosnowski, and
          <string-name>
            <given-names>Kristina</given-names>
            <surname>Yordanova</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Towards Evaluating Plan Generation Approaches with Instructional Texts</article-title>
          . arXiv preprint arXiv:
          <year>2001</year>
          .
          <volume>04186</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Tushar</given-names>
            <surname>Khot</surname>
          </string-name>
          , Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon,
          <string-name>
            <given-names>Sumithra</given-names>
            <surname>Bhakthavatsalam</surname>
          </string-name>
          , et al.
          <year>2019</year>
          .
          <string-name>
            <surname>From'F'to'</surname>
          </string-name>
          <article-title>A'on the NY Regents Science Exams: An Overview of the Aristo Project</article-title>
          . arXiv preprint arXiv:
          <year>1909</year>
          .
          <year>01958</year>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Robin</surname>
            <given-names>Cooper</given-names>
          </string-name>
          , Dick Crouch, Jan Van Eijck,
          <string-name>
            <surname>Chris Fox</surname>
          </string-name>
          , Johan Van Genabith,
          <string-name>
            <surname>Jan Jaspars</surname>
            , Hans Kamp, David Milward,
            <given-names>Manfred</given-names>
          </string-name>
          <string-name>
            <surname>Pinkal</surname>
            ,
            <given-names>Massimo</given-names>
          </string-name>
          <string-name>
            <surname>Poesio</surname>
          </string-name>
          , et al.
          <year>1996</year>
          .
          <article-title>Using the framework</article-title>
          .
          <source>Technical Report. Technical Report LRE 62-051 D-16</source>
          , The FraCaS Consortium.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Ido</surname>
            <given-names>Dagan</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Oren</given-names>
            <surname>Glickman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Bernardo</given-names>
            <surname>Magnini</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>The PASCAL recognising textual entailment challenge</article-title>
          .
          <source>In Machine Learning Challenges Workshop</source>
          . Springer,
          <fpage>177</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Donald</given-names>
            <surname>Davidson</surname>
          </string-name>
          .
          <year>1967</year>
          .
          <article-title>The logical forms of action sentences. The Logic of Decision and Action (</article-title>
          <year>1967</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Jacob</surname>
            <given-names>Devlin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Dheeru</surname>
            <given-names>Dua</given-names>
          </string-name>
          , Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh,
          <string-name>
            <given-names>and Matt</given-names>
            <surname>Gardner</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs</article-title>
          . arXiv preprint arXiv:
          <year>1903</year>
          .
          <volume>00161</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michele</given-names>
            <surname>Banko</surname>
          </string-name>
          , Stephen Soderland, and Daniel S Weld.
          <year>2008</year>
          .
          <article-title>Open information extraction from the web</article-title>
          .
          <source>Commun. ACM</source>
          <volume>51</volume>
          ,
          <issue>12</issue>
          (
          <year>2008</year>
          ),
          <fpage>68</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Edward</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Feigenbaum</surname>
          </string-name>
          .
          <year>1992</year>
          .
          <article-title>Expert systems: principles and practice</article-title>
          . (
          <year>1992</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>David</given-names>
            <surname>Ferrucci</surname>
          </string-name>
          ,
          <string-name>
            <surname>Eric Brown</surname>
          </string-name>
          , Jennifer Chu-Carroll,
          <string-name>
            <given-names>James</given-names>
            <surname>Fan</surname>
          </string-name>
          , David Gondek, Aditya A Kalyanpur,
          <string-name>
            <given-names>Adam</given-names>
            <surname>Lally</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J William</given-names>
            <surname>Murdock</surname>
          </string-name>
          , Eric Nyberg, John Prager, et al.
          <year>2010</year>
          .
          <article-title>Building Watson: An overview of the DeepQA project</article-title>
          .
          <source>AI</source>
          magazine
          <volume>31</volume>
          ,
          <issue>3</issue>
          (
          <year>2010</year>
          ),
          <fpage>59</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Noah S Friedland</surname>
          </string-name>
          , Paul G Allen,
          <article-title>Gavin Matthews</article-title>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Witbrock</surname>
          </string-name>
          , David Baxter,
          <string-name>
            <given-names>Jon</given-names>
            <surname>Curtis</surname>
          </string-name>
          , Blake Shepard, Pierluigi Miraglia, Jurgen Angele,
          <string-name>
            <given-names>Stefen</given-names>
            <surname>Staab</surname>
          </string-name>
          , et al .
          <year>2004</year>
          .
          <article-title>Project halo: Towards a digital aristotle</article-title>
          .
          <source>AI</source>
          magazine
          <volume>25</volume>
          ,
          <issue>4</issue>
          (
          <year>2004</year>
          ),
          <fpage>29</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Wachara</given-names>
            <surname>Fungwacharakorn</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ken</given-names>
            <surname>Satoh</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Legal Debugging in Propositional Legal Representation</article-title>
          .
          <source>In JSAI International Symposium on Artificial Intelligence</source>
          . Springer,
          <fpage>146</fpage>
          -
          <lpage>159</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Bryan</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Gardner</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Black's Law Dictionary (11 ed</article-title>
          .).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Danilo</surname>
            <given-names>Giampiccolo</given-names>
          </string-name>
          , Bernardo Magnini, Ido Dagan, and
          <string-name>
            <given-names>Bill</given-names>
            <surname>Dolan</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>The third pascal recognizing textual entailment challenge</article-title>
          .
          <source>In Proceedings of the ACLPASCAL workshop on textual entailment and paraphrasing. Association for Computational Linguistics</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>David</given-names>
            <surname>Gunning</surname>
          </string-name>
          , Vinay K Chaudhri,
          <string-name>
            <surname>Peter E Clark</surname>
          </string-name>
          , Ken Barker,
          <string-name>
            <surname>Shaw-Yi</surname>
            <given-names>Chaw</given-names>
          </string-name>
          , Mark Greaves, Benjamin Grosof, Alice Leung,
          <string-name>
            <surname>David D McDonald</surname>
            ,
            <given-names>Sunil</given-names>
          </string-name>
          <string-name>
            <surname>Mishra</surname>
          </string-name>
          , et al.
          <year>2010</year>
          .
          <article-title>Project Halo Update-Progress Toward Digital Aristotle</article-title>
          .
          <source>AI</source>
          Magazine
          <volume>31</volume>
          ,
          <issue>3</issue>
          (
          <year>2010</year>
          ),
          <fpage>33</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>R Bar</given-names>
            <surname>Haim</surname>
          </string-name>
          , Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and
          <string-name>
            <given-names>Idan</given-names>
            <surname>Szpektor</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>The second pascal recognising textual entailment challenge</article-title>
          .
          <source>In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment.</source>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Herbert</given-names>
            <surname>Lionel Adolphus Hart and Herbert Lionel Adolphus Hart</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>The concept of law</article-title>
          . Oxford university press.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Hellawell</surname>
          </string-name>
          .
          <year>1980</year>
          .
          <article-title>A computer program for legal planning and analysis: Taxation of stock redemptions</article-title>
          .
          <source>Columbia Law Review</source>
          <volume>80</volume>
          ,
          <issue>7</issue>
          (
          <year>1980</year>
          ),
          <fpage>1363</fpage>
          -
          <lpage>1398</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Iofe</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>
          .
          <source>arXiv preprint arXiv:1502.03167</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Imran</surname>
            <given-names>Khan</given-names>
          </string-name>
          , Muhammad Sher,
          <string-name>
            <surname>Javed I Khan</surname>
          </string-name>
          ,
          <article-title>Syed M Saqlain, Anwar Ghani, Husnain A Naqvi,</article-title>
          and Muhammad Usman Ashraf.
          <year>2016</year>
          .
          <article-title>Conversion of legal text to a logical rules set from medical law using the medical relational model and the world rule model for a medical decision support system</article-title>
          .
          <source>In Informatics</source>
          , Vol.
          <volume>3</volume>
          . Multidisciplinary Digital Publishing Institute,
          <volume>2</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Mi-Young</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Juliano Rabelo, and
          <string-name>
            <given-names>Randy</given-names>
            <surname>Goebel</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Statute Law Information Retrieval and Entailment</article-title>
          .
          <source>In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law</source>
          .
          <volume>283</volume>
          -
          <fpage>289</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P</given-names>
          </string-name>
          <string-name>
            <surname>Kingma and Jimmy Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Anastassia</given-names>
            <surname>Kornilova</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vladimir</given-names>
            <surname>Eidelman</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BillSum: A Corpus for Automatic Summarization of US Legislation</article-title>
          .
          <source>In Proceedings of the 2nd Workshop on New Frontiers in Summarization. Association for Computational Linguistics</source>
          , Hong Kong, China,
          <fpage>48</fpage>
          -
          <lpage>56</lpage>
          . https://doi.org/10.18653/v1/
          <fpage>D19</fpage>
          -5406
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>John</surname>
            <given-names>Laferty</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andrew McCallum</surname>
          </string-name>
          , and
          <source>Fernando CN Pereira</source>
          .
          <year>2001</year>
          .
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          . (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Robert</surname>
            <given-names>S</given-names>
          </string-name>
          <string-name>
            <surname>Ledley and Lee B Lusted</surname>
          </string-name>
          .
          <year>1959</year>
          .
          <article-title>Reasoning foundations of medical diagnosis</article-title>
          .
          <source>Science</source>
          <volume>130</volume>
          ,
          <issue>3366</issue>
          (
          <year>1959</year>
          ),
          <fpage>9</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>L</given-names>
            <surname>Thorne McCarty</surname>
          </string-name>
          .
          <year>1976</year>
          .
          <article-title>Reflections on TAXMAN: An experiment in artificial intelligence and legal reasoning</article-title>
          . Harv. L. Rev.
          <volume>90</volume>
          (
          <year>1976</year>
          ),
          <fpage>837</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Jef</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>3111</volume>
          -
          <fpage>3119</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <surname>Randolph</surname>
            <given-names>A Miller</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Harry E Pople</given-names>
            <surname>Jr</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jack</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Myers</surname>
          </string-name>
          .
          <year>1982</year>
          .
          <article-title>Internist-I, an experimental computer-based diagnostic consultant for general internal medicine</article-title>
          .
          <source>New England Journal of Medicine 307</source>
          ,
          <issue>8</issue>
          (
          <year>1982</year>
          ),
          <fpage>468</fpage>
          -
          <lpage>476</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41] Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Bishan Yang, Justin Betteridge, Andrew Carlson, Bhanava Dalvi, Matt Gardner,
          <string-name>
            <given-names>Bryan</given-names>
            <surname>Kisiel</surname>
          </string-name>
          , et al.
          <year>2018</year>
          .
          <article-title>Never-ending learning</article-title>
          .
          <source>Commun. ACM 61</source>
          ,
          <issue>5</issue>
          (
          <year>2018</year>
          ),
          <fpage>103</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>Terence</given-names>
            <surname>Parsons</surname>
          </string-name>
          .
          <year>1990</year>
          .
          <article-title>Events in the Semantics of English</article-title>
          . Vol.
          <volume>334</volume>
          . MIT press Cambridge, MA.
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <surname>Walter</surname>
            <given-names>G</given-names>
          </string-name>
          <string-name>
            <surname>Popp and Bernhard Schlink</surname>
          </string-name>
          .
          <year>1974</year>
          .
          <article-title>Judith, a computer program to advise lawyers in reasoning a case</article-title>
          .
          <source>Jurimetrics J</source>
          .
          <volume>15</volume>
          (
          <year>1974</year>
          ),
          <fpage>303</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <surname>Juliano</surname>
            <given-names>Rabelo</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mi-Young Kim</surname>
            , and
            <given-names>Randy</given-names>
          </string-name>
          <string-name>
            <surname>Goebel</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Combining Similarity and Transformer Methods for Case Law Entailment</article-title>
          .
          <source>In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law</source>
          .
          <volume>290</volume>
          -
          <fpage>296</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <surname>Colin</surname>
            <given-names>Rafel</given-names>
          </string-name>
          , Noam Shazeer, Adam Roberts,
          <string-name>
            <given-names>Katherine</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sharan</given-names>
            <surname>Narang</surname>
          </string-name>
          , Michael Matena,
          <string-name>
            <surname>Yanqi Zhou</surname>
            ,
            <given-names>Wei</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Peter J Liu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          . arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>10683</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <surname>Pranav</surname>
            <given-names>Rajpurkar</given-names>
          </string-name>
          , Robin Jia, and
          <string-name>
            <given-names>Percy</given-names>
            <surname>Liang</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Know What You Don't Know: Unanswerable Questions for SQuAD</article-title>
          .
          <source>In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <surname>Abhilasha</surname>
            <given-names>Ravichander</given-names>
          </string-name>
          , Alan W Black, Shomir Wilson, Thomas Norton, and
          <string-name>
            <given-names>Norman</given-names>
            <surname>Sadeh</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Question Answering for Privacy Policies: Combining Computational and Legal Perspectives</article-title>
          . arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>00841</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <surname>Edwina</surname>
            <given-names>L Rissland</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kevin D Ashley</surname>
          </string-name>
          , and Ronald Prescott Loui.
          <year>2003</year>
          .
          <article-title>AI and Law: A fruitful synergy</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>150</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          (
          <year>2003</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <surname>Melissa</surname>
            <given-names>Roemmele</given-names>
          </string-name>
          ,
          <source>Cosmin Adrian Bejan, and Andrew S Gordon</source>
          .
          <year>2011</year>
          .
          <article-title>Choice of plausible alternatives: An evaluation of commonsense causal reasoning</article-title>
          .
          <source>In 2011 AAAI Spring Symposium Series.</source>
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <surname>Marzieh</surname>
            <given-names>Saeidi</given-names>
          </string-name>
          , Max Bartolo,
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sameer</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tim Rocktäschel</surname>
          </string-name>
          , Mike Sheldon, Guillaume Bouchard, and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Riedel</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Interpretation of natural language rules in conversational machine reading</article-title>
          . arXiv preprint arXiv:
          <year>1809</year>
          .
          <volume>01494</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <surname>Ken</surname>
            <given-names>Satoh</given-names>
          </string-name>
          , Kento Asai, Takamune Kogawa, Masahiro Kubota, Megumi Nakamura, Yoshiaki Nishigai, Kei Shirakawa, and
          <string-name>
            <given-names>Chiaki</given-names>
            <surname>Takano</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>PROLEG: an implementation of the presupposed ultimate fact theory of Japanese civil code by PROLOG technology</article-title>
          .
          <source>In JSAI International Symposium on Artificial Intelligence</source>
          . Springer,
          <fpage>153</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <surname>Marek</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sergot</surname>
            , Fariba Sadri, Robert A. Kowalski, Frank Kriwaczek,
            <given-names>Peter</given-names>
          </string-name>
          <string-name>
            <surname>Hammond</surname>
            , and
            <given-names>H Terese</given-names>
          </string-name>
          <string-name>
            <surname>Cory</surname>
          </string-name>
          .
          <year>1986</year>
          .
          <article-title>The British Nationality Act as a logic program</article-title>
          .
          <source>Commun. ACM 29</source>
          ,
          <issue>5</issue>
          (
          <year>1986</year>
          ),
          <fpage>370</fpage>
          -
          <lpage>386</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            <surname>David</surname>
            <given-names>M</given-names>
          </string-name>
          <string-name>
            <surname>Sherman</surname>
          </string-name>
          .
          <year>1987</year>
          .
          <article-title>A Prolog model of the income tax act of Canada</article-title>
          .
          <source>In Proceedings of the 1st international conference on Artificial intelligence and law</source>
          .
          <volume>127</volume>
          -
          <fpage>136</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [54]
          <string-name>
            <surname>Edward</surname>
            <given-names>H</given-names>
          </string-name>
          <string-name>
            <surname>Shortlife and Bruce G Buchanan</surname>
          </string-name>
          .
          <year>1975</year>
          .
          <article-title>A model of inexact reasoning in medicine</article-title>
          .
          <source>Mathematical biosciences 23</source>
          ,
          <fpage>3</fpage>
          -
          <lpage>4</lpage>
          (
          <year>1975</year>
          ),
          <fpage>351</fpage>
          -
          <lpage>379</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [55]
          <string-name>
            <surname>Jerrold</surname>
            <given-names>Soh</given-names>
          </string-name>
          , How Khang Lim, and Ian Ernst Chai.
          <year>2019</year>
          .
          <article-title>Legal Area Classification: A Comparative Study of Text Classifiers on Singapore Supreme Court Judgments</article-title>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Minneapolis, Minnesota.
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          <source>[56] Anne vdL Gardner</source>
          .
          <year>1983</year>
          .
          <article-title>The design of a legal analysis program</article-title>
          .
          <source>In AAAI-83</source>
          .
          <fpage>114</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>Lai</given-names>
            <surname>Dac</surname>
          </string-name>
          <string-name>
            <surname>Viet</surname>
          </string-name>
          , Vu Trong Sinh, Nguyen Le Minh, and
          <string-name>
            <given-names>Ken</given-names>
            <surname>Satoh</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>ConvAMR: Abstract meaning representation parsing for legal document</article-title>
          .
          <source>arXiv preprint arXiv:1711.06141</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>Thiemo</given-names>
            <surname>Wambsganß</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hansjörg</given-names>
            <surname>Fromm</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Mining User-Generated Repair Instructions from Automotive Web Communities</article-title>
          .
          <source>In Proceedings of the 52nd Hawaii International Conference on System Sciences.</source>
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          [59]
          <string-name>
            <surname>Alex</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh,
          <string-name>
            <surname>Julian Michael</surname>
          </string-name>
          , Felix Hill,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Samuel</given-names>
            <surname>Bowman</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Superglue: A stickier benchmark for general-purpose language understanding systems</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
          <volume>3261</volume>
          -
          <fpage>3275</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          [60]
          <string-name>
            <surname>Jason</surname>
            <given-names>Weston</given-names>
          </string-name>
          , Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer,
          <string-name>
            <surname>Armand Joulin</surname>
            , and
            <given-names>Tomas</given-names>
          </string-name>
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Towards ai-complete question answering: A set of prerequisite toy tasks</article-title>
          .
          <source>arXiv preprint arXiv:1502.05698</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          [61]
          <string-name>
            <surname>Zhilin</surname>
            <given-names>Yang</given-names>
          </string-name>
          , Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering</article-title>
          .
          <source>In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics</source>
          , Brussels, Belgium,
          <fpage>2369</fpage>
          -
          <lpage>2380</lpage>
          . https://doi.org/10. 18653/v1/
          <fpage>D18</fpage>
          -1259
        </mixed-citation>
      </ref>
      <ref id="ref62">
        <mixed-citation>
          [62]
          <string-name>
            <surname>Masaharu</surname>
            <given-names>Yoshioka</given-names>
          </string-name>
          , Yoshinobu Kano, Naoki Kiyota, and
          <string-name>
            <given-names>Ken</given-names>
            <surname>Satoh</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of japanese statute law retrieval and entailment task at coliee-2018</article-title>
          . In Twelfth International Workshop on Juris-informatics
          <source>(JURISIN</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref63">
        <mixed-citation>
          [63]
          <string-name>
            <surname>Rowan</surname>
            <given-names>Zellers</given-names>
          </string-name>
          , Yonatan Bisk,
          <string-name>
            <given-names>Roy</given-names>
            <surname>Schwartz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yejin</given-names>
            <surname>Choi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Swag: A largescale adversarial dataset for grounded commonsense inference</article-title>
          .
          <source>arXiv preprint arXiv:1808</source>
          .
          <volume>05326</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref64">
        <mixed-citation>
          [64]
          <string-name>
            <surname>Sheng</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Xutai Ma, Kevin Duh, and Benjamin Van Durme.
          <year>2019</year>
          .
          <article-title>BroadCoverage Semantic Parsing as Transduction</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <fpage>3786</fpage>
          -
          <lpage>3798</lpage>
          . https://doi.org/ 10.18653/v1/
          <fpage>D19</fpage>
          -1392
        </mixed-citation>
      </ref>
      <ref id="ref65">
        <mixed-citation>
          [65]
          <string-name>
            <surname>Ziqi</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Philip Webster, Victoria S Uren, Andrea Varga, and
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Automatically Extracting Procedural Knowledge from Instructional Texts using Natural Language Processing.</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>LREC</given-names>
          </string-name>
          , Vol.
          <year>2012</year>
          .
          <volume>520</volume>
          -
          <fpage>527</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref66">
        <mixed-citation>
          [66]
          <string-name>
            <surname>Haoxi</surname>
            <given-names>Zhong</given-names>
          </string-name>
          , Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and
          <string-name>
            <given-names>Maosong</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>JEC-QA: A Legal-Domain Question Answering Dataset</article-title>
          . arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>12011</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>