Automatic Classification of Rhetorical Roles for Sentences:
           Comparing Rule-Based Scripts with Machine Learning
                           Vern R. Walker                                                          Krishnan Pillaipakkamnatt
    Director, Research Laboratory for Law, Logic &                                           Chair, Department of Computer Science
                Technology (LLT Lab)                                                    Fred DeMatteis School of Engineering and Applied
   Maurice A. Deane School of Law, Hofstra University                                              Science, Hofstra University
              Hempstead, New York, USA                                                             Hempstead, New York, USA
               vern.r.walker@hofstra.edu                                                      krishnan.pillaipakkamnatt@hofstra.edu

    Alexandra M. Davidson                                                Marysa Linares                               Domenick J. Pesce
  Research Laboratory for Law,                                  Research Laboratory for Law,                    Research Laboratory for Law,
 Logic & Technology (LLT Lab)                                  Logic & Technology (LLT Lab)                    Logic & Technology (LLT Lab)
 Maurice A. Deane School of Law,                               Maurice A. Deane School of Law,                 Maurice A. Deane School of Law,
       Hofstra University                                            Hofstra University                              Hofstra University
  Hempstead, New York, USA                                      Hempstead, New York, USA                        Hempstead, New York, USA
       lltlab@hofstra.edu                                            lltlab@hofstra.edu                              lltlab@hofstra.edu


ABSTRACT                                                                          believed. The datasets, the protocols used to define sentence types,
                                                                                  the scripts and ML codes will be publicly available.
Automatically mining patterns of reasoning from evidence-
intensive legal decisions can make legal services more efficient,                 ACM Reference format:
and it can increase the public’s access to justice, through a range of            Vern R. Walker, Krishnan Pillaipakkamnatt, Alexandra M. Davidson,
use cases (including semantic viewers, semantic search, decision                  Marysa Linares and Domenick J. Pesce. 2019. Automatic Classification of
summarizers, argument recommenders, and reasoning monitors).                      Rhetorical Roles for Sentences: Comparing Rule-Based Scripts with
Important to these use cases is the task of automatically classifying             Machine Learning. In Proceedings of the Third Workshop on Automated
                                                                                  Semantic Analysis of Information in Legal Text (ASAIL 2019), Montreal,
those sentences that state whether the conditions of applicable legal             QC, Canada, 10 pages.
rules have been satisfied or not in a particular legal case. However,
insufficient quantities of gold-standard semantic data, and the high
cost of generating such data, threaten to undermine the                           1 Introduction
development of such automatic classifiers. This paper tests two                   Automated argument mining would add greatly to productivity in
hypotheses: whether distinctive phrasing enables the development                  legal practice, and it could increase access to justice in many legal
of automatic classifiers on the basis of a small sample of labeled                areas through a range of use cases. As a first use case, a web-based
decisions, with adequate results for some important use cases, and                semantic viewer might automatically highlight for the user those
whether semantic attribution theory provides a general                            sentences and patterns that are of interest in argumentation. For
methodology for developing such classifiers. The paper reports                    example, a semantic viewer might open a new decision document
promising results from using a qualitative methodology to analyze                 and provide filters that the user could select to highlight only the
a small sample of classified sentences (N = 530) to develop rule-                 conclusions, or the evidence, or the stated reasoning from evidence
based scripts that can classify sentences that state findings of fact             to conclusions.
(“Finding Sentences”). We compare those results with the                              Second, we could create semantic search tools, which would use
performance of standard machine learning (ML) algorithms trained                  components of reasoning to search through hundreds of thousands
and tested on a larger dataset (about 5,800 labeled sentences),                   of decisions in plain-text format, retrieve those decisions similar to
which is still relatively small by ML standards. This methodology                 a new case (e.g., those with similar issues to be proved, and similar
and these test results suggest that some access-to-justice use cases              types of evidence available for proving them), rank the similar
can be adequately addressed at much lower cost than previously                    decisions in order of greatest similarity, and then display the
                                                                                  portions to read (using a semantic viewer).
__________________________________________________________________
                                                                                      Third, the capability of extracting from published decisions the
In: Proceedings of the Third Workshop on Automated Semantic Analysis of           major conclusions, the intermediate reasoning, and the evidentiary
Information in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada.      basis for a decision would also provide the components of an
© 2019 Copyright held by the owner/author(s). Copying permitted for private and
academic purposes.                                                                informative summary of that decision. A decision summarizer
Published at http://ceur-ws.org
 ASAIL 2019, June 2019, Montreal, QC, Canada                                                                            V.R. Walker et al.

could analyze a decision and create a digest or summary for the          classify argument types is therefore a problem that any effort to
human reader.                                                            create gold-standard data must address.
    Fourth, an automatic argument miner could extract successful             This paper reports on preliminary research that addresses all of
and unsuccessful patterns of reasoning from thousands of past            these problems. Our main hypothesis is that reports of fact-finding
decisions, and then it could generate suggestions for new arguments      in legal adjudicatory decisions might employ such regular and
in new cases. Such an argument recommender could assist                  distinctive phrasing that even rule-based scripts based on a very
attorneys, non-attorneys, and judges in processing new cases.            small sample, as well as ML models trained on larger samples, can
    Fifth, such an automated argument miner could monitor cases          perform adequately for many valuable use cases. Our second
as they are being litigated, compare evolving arguments with             hypothesis is that attribution theory from linguistics can be
patterns of reasoning that have been successful or unsuccessful in       extended to argument mining, as a method for creating semantic
the past, and detect possible outliers (arguments likely to fail or      types and automatically identifying them in legal decisions. There
decisions likely to be incorrect). A reasoning monitor could also        is reason to think that such an approach will be transferable to
maintain statistics and trends for patterns of reasoning, and it could   adjudicatory decisions in many substantive areas of law.
predict probabilities of success for new cases.                              Using an annotated dataset of U.S. decisions as the gold
    Provided the data derived from argument mining is valid and          standard, we investigated a methodology for qualitatively studying
predictive of real-case outcomes, such tools could assist alternative    a very small sub-sample of such decisions, developing rule-based
dispute resolution and increase efficiency within the legal system.      scripts, and quantitatively testing the script performance. We
Such automated, evidence-based tools (semantic viewers, semantic         compared those outcomes against the performance of standard
search, decision summarizers, argument recommenders, and                 supervised ML models trained on larger samples from the same
reasoning monitors) could also assist non-lawyers when they              dataset. Our study addresses data quantity by employing very small
represent themselves in cases where a lawyer is not available.           datasets; it addresses data validity by employing quality-assurance
    Producing such a range of tools, however, faces several              protocols and publishing those protocols and the resulting data; and
challenges. One challenge is whether machine learning (ML) tools         it addresses annotation type systems by explaining their derivation.
can be effective at automating such argument mining. First, there is     The paper reports promising results relative to important use cases,
the problem of the available quantity of gold-standard data for          and it lays out a methodology that should be transferable to many
training and testing. Supervised ML may require such a large             areas of law at a relatively small cost – thus helping to improve
quantity of accurately labeled documents that there are not              access to justice. We make publicly available the annotated dataset,
sufficient resources to generate it, in all areas of law where           the quality-assurance protocols, the scripts and the ML settings, at
argument mining is desirable. While semi-supervised ML, using            https://github.com/LLTLab/VetClaims-JSON.
large quantities of unlabeled data and small quantities of labeled           After we discuss prior related work in Section 2, we describe
data, offers more promise, even that approach requires trained           our dataset and how we used it in both our script and ML
human classifiers. Especially in legal areas where outcomes may          experiments (Section 3). Section 4 describes the qualitative-study
not economically support the hiring of a lawyer (e.g., veterans’         experiments and their results, while Section 5 describes the ML
disability claims or immigration asylum claims), there may be little     experiments and their results. Section 6 contains our general
financial incentive to create such a great quantity of ground-truth      discussion of these combined results and our future work.
annotated data. Moreover, some areas of law (such as vaccine-
injury compensation decisions) may not even produce a sufficient         2 Prior Related Work
quantity of cases bearing on a particular issue, even if we were to
                                                                         The context for the work reported in this paper is the goal of
annotate it all.
                                                                         automated argument mining from adjudicatory legal decisions.
    Moreover, there is the challenge of ensuring data validity. In
                                                                         Such argument mining would automatically extract the evidence
order to create effective tools, especially tools that predict future
                                                                         assessment and fact-finding reasoning found in adjudicatory
outcomes given past decisions, the data upon which the tools are
                                                                         decisions, for the purpose of identifying successful and
trained must accurately reflect what we believe it measures. But
                                                                         unsuccessful units of evidentiary argument. Researchers generally
appropriately annotating the components of legal reasoning
                                                                         identify an argument as containing a conclusion or claim, together
requires an adequate theory of legal reasoning, a sufficient number
                                                                         with a set of one or more premises. [21, 40, 30, 15, 32]
of trained annotators, and adequate quality assurance. Also, to
                                                                            A first level of analysis is to classify the rhetorical roles of
inspire trust in ML outputs, the models must be transparent and
                                                                         sentences for argument mining – that is, assigning sentences roles
understandable.
                                                                         as either premise or conclusion. Prior work on classifying such
    Finally, there is the challenge of developing and testing
                                                                         rhetorical roles in adjudicatory decisions includes: applying
adequate classification type systems for legal arguments [34, 38].
                                                                         machine learning to annotate sentence types in vaccine-injury
Unsupervised ML has the challenge of producing useful clusters,
                                                                         compensation decisions [3, 4, 5, 10]; assigning rhetorical roles to
especially if the components of legal argument are poorly
                                                                         sentences in Indian court decisions [27]; classifying sentences as
understood, the components vary depending on the use case, and
                                                                         argumentative in the Araucaria corpus, including newspapers and
the components are different for different practitioners. How to
                                                                         court reports [19]; automatically summarizing legal judgments of
                                                                         the U.K.’s House of Lords, in part by classifying the rhetorical
 Automatic Classification of Rhetorical Roles for Sentences                                 ASAIL 2019, June 2019, Montreal, QC, Canada

status of sentences [13]; annotating sentences as containing legal         example, the main verb of a finding sentence tends to be in present
principles or facts in common-law reports [29]; and using statistical      tense, while the main verbs of evidence sentences tend to be in past
parsing as the input for computing quasi-logical forms as deep             tense. Features derived from the protocols can drive the application
semantic interpretations of sentences in U.S. appellate court              of high-precision / low-recall techniques of the kind used
decisions [17]. Al-Abdulkarim et al. provide one overview of               successfully by [15], which we argue is the performance desired for
statement types involved in legal reasoning in cases, from evidence        certain use cases but not others.
to a verdict [1]. The approach we describe in this paper utilizes a
type system of rhetorical roles developed to annotate any fact-            3 The Datasets
finding decision, and we compare script and ML classifiers for the
                                                                           We developed a common dataset to use in comparing the
rhetorical roles of sentences. Moreover, it is not common for
                                                                           classification performance of rule-based script classifiers with the
datasets to be publicly available, together with protocols for data
                                                                           performance of ML models. This section describes that dataset and
generation, scripts and codes, to enable confirmation of data
                                                                           how it was used.
accuracy and replication of results.
    A recent article compared two experiments in automated                 3.1 The BVA PTSD Dataset
classification of legal norms from German statutes, with regard to
their semantic type: (1) a rule-based approach using hand-crafted          We analyzed 50 fact-finding decisions issued by the U.S. Board of
pattern definitions, and (2) an ML approach. [39] (For similar work        Veterans’ Appeals (“BVA”) from 2013 through 2017 (the “PTSD
on Dutch laws, see [16].) The performance metrics for the two              dataset”). We arbitrarily selected those decisions from adjudicated
experiments were comparable on a dataset of manually-labeled               disability claims by veterans for service-related post-traumatic
statements. While this study is highly relevant to our work, there         stress disorder (PTSD). PTSD is a mental health problem that some
are distinct differences. We develop a qualitative methodology for         people develop after experiencing or witnessing a traumatic event,
developing classification features of sentences in adjudicatory            such as combat or sexual assault. Individual claims for
decisions (not statutes), according to their rhetorical role (not norm     compensation for a disability usually originate at a Regional Office
type), for the purpose of automated argument mining. Our                   (“RO”) of the U.S. Department of Veterans Affairs (“VA”), or at
methodology is general, and it should be transferable to                   another local office across the country [2, 20]. If the claimant is
adjudicatory decisions in any substantive area of law.                     dissatisfied with the decision of the RO, she may file an appeal to
    To identify the rhetorical roles of sentences, we employ an            the BVA, which is an administrative appellate body that has the
extension of the semantic theory of attribution analysis. Attribution,     authority to decide the facts of each case based on the evidence
in the context of argument mining, is the descriptive task of              [20]. The BVA must provide a written statement of the reasons or
determining which actor is asserting, assuming or relying upon             bases for its findings and conclusions, and that statement “must
which propositions, in the course of presenting reasoning or               account for the evidence which [the BVA] finds to be persuasive or
argument. Although attribution is a classic problem area in natural        unpersuasive, analyze the credibility and probative value of all
language processing generally [7, 14, 22, 23], there has been              material evidence submitted by and on behalf of a claimant, and
limited work on attribution in respect to argument mining from             provide the reasons for its rejection of any such evidence.” Caluza
legal documents. Grover et al. reported on a project to annotate           v. Brown, 7 Vet. App. 498, 506 (1995), aff’d, 78 F.3d 604 (Fed. Cir.
sentences in House of Lords judgments for their argumentative              1996).
roles [11]. Two tasks were to attribute statements to the Law Lord             The veteran may appeal the BVA’s decision to the U.S. Court
speaking about the case or to someone else (attribution), and to           of Appeals for Veterans Claims (the “Veterans Court”) [20], but the
classify sentences as formulating the law objectively vs. assessing        standard of review for issues of fact is very deferential to the BVA.
the law as favoring a conclusion or not favoring it (comparison).          In order to set aside a finding of fact by the BVA, the Veterans
This work extended the work of [31] on attribution in scientific           Court must determine it to be “clearly erroneous.” [20] And
articles. A broader discussion of attribution within the context of        although either the claimant or the VA may appeal a Veterans Court
legal decisions is found in [34]. Unlike the adjudicatory decisions        decision to the U.S. Court of Appeals for the Federal Circuit, the
used in our study, the House of Lords judgments studied by [11]            Federal Circuit may only review questions of law, such as a
treated facts as already settled in the lower courts. Our study            constitutional challenge, or the interpretation of a statute or
appears to be unique in using attribution analysis to help classify        regulation relied upon by the Veterans Court. [2, 20] Except for
the rhetorical roles of sentences in the evidence assessment portions      constitutional issues, it “may not review any ‘challenge to a factual
of adjudicatory texts.                                                     determination’ or any ‘challenge to a law or regulation as applied
    We have also developed classification protocols (classification        to the facts of a particular case.’” Kalin v. Nicholson, 172 Fed.Appx.
criteria and methods) for each rhetorical role. We use protocols to        1000, 1002 (Fed.Cir. 2006). Thus, the findings of fact made by the
give precise meaning to the semantic type, to train new annotators,        BVA are critical to the success or failure of a veteran’s claim.
and to review the accuracy of human annotations. We also use such              The BVA’s workload has increased dramatically in the past
protocols to guide the development of the features or rule-based           decade, reaching 85,288 decisions in fiscal year 2018. [6, p. 32]
scripts for automatically classifying legal texts (e.g., [28]). Stab and   The vast majority of appeals (96%) considered by the BVA involve
Gurevych have classified such features into 5 groups [30]. For             claims for compensation. [6, p. 31] Therefore, identifying the
    ASAIL 2019, June 2019, Montreal, QC, Canada                                                                                   V.R. Walker et al.

patterns of factual reasoning within the decisions of the BVA                        Reasoning Sentence. A Reasoning Sentence primarily reports
presents a significant challenge for automated argument mining.                  the trier of fact’s reasoning underlying the findings of fact
    For each of the 50 BVA decisions in our PTSD dataset, we                     (therefore, a premise). Such reasoning often involves an assessment
extracted all sentences addressing the factual issues related to the             of the credibility and probative value of the evidence. An example
claim for PTSD, or for a closely-related psychiatric disorder. This              of a Reasoning Sentence is: “Also, the clinician’s etiological
set of sentences (“PTSD-Sent”) is the dataset on which we                        opinions are credible based on their internal consistency and her
conducted our experiments. The “Reasons and Bases” section of                    duty to provide truthful opinions.” (BVA1340434)
the decision is the longest section, containing the Board’s statement                A unit of argument or reasoning within evidence assessment is
of the evidence, its evaluation of that evidence, and its findings of            usually composed of these three types of sentence (finding,
fact on the relevant legal issues.                                               evidence, and reasoning). The “Reasons and Bases” section of a
                                                                                 BVA decision generally also includes two other types of sentence
3.1.1 Rhetorical Roles of Sentences in the PTSD-                                 (those stating legal rules and citations), which must be
     Sent Dataset                                                                distinguished from the first three. Unlike the case-specific elements
For the purpose of identifying reasoning or argument patterns, we                of evidence, reasoning and findings, the legal rules and citations are
focus primarily on sentences that play one of three rhetorical roles             often the same for tens of thousands of cases, even though the
in evidence assessment: the finding of fact, which states whether a              sentences stating those rules and citations can be highly variable
propositional condition of a legal rule is determined to be true, false          linguistically, depending upon the writing style of the judge.
or undecided; the evidence in the legal record on which the findings                 Legal-Rule Sentence. A Legal-Rule Sentence primarily states
rest, such as the testimony of a lay witness or a medical record; and            one or more legal rules in the abstract, without stating whether the
the reasoning from the evidence to the findings of fact. Identifying             conditions of the rule(s) are satisfied in the case being decided. An
the sentences that have those roles within adjudicatory decisions,               example of a Legal-Rule Sentence is: “Establishing direct service
however, presents special problems. Such decisions have a wide                   connection generally requires medical or, in certain
diversity of roles for sentences: e.g., stating the legal rules, policies        circumstances, lay evidence of (1) a current disability; (2) an in-
and principles applicable to the decision, as well as providing                  service incurrence or aggravation of a disease or injury; and (3) a
citations to authority; stating the procedural history of the case, and          nexus between the claimed in-service disease or injury and the
the rulings on procedural issues; summarizing the evidence                       present disability.” (BVA1340434)
presented and the arguments of the parties based on that evidence;                   Citation Sentence. A Citation Sentence references legal
and stating and explaining the tribunal’s findings of fact based on              authorities or other materials, and usually contains standard
that evidence. [37] Thus, BVA decisions pose the challenge of                    notation that encodes useful information about the cited source. An
classifying rhetorically important types of sentence and                         example is: “See Dalton v. Nicholson, 21 Vet. App. 23, 38 (2007);
distinguishing them from other types of sentence.                                Caluza v. Brown, 7 Vet. App. 498, 511 (1995), aff'd per curiam, 78
    The following are the 5 rhetorical roles that we used to classify            F.3d 604 (Fed. Cir. 1996).” (BVA1340434)
sentences in the PTSD-Sent dataset. Sentences were classified                        The frequencies of sentence rhetorical types within the PTSD-
manually by teams of 2 trained law students, and they were curated               Sent dataset are shown in Table 1.
by a law professor with expertise in legal reasoning. Data validity                   Rhetorical Type                    Frequency
is open to scrutiny because the data will be publicly available.                      Finding Sentence                   490
    Finding Sentence. A Finding Sentence is a sentence that                           Evidence Sentence                  2,419
primarily states a “finding of fact” – an authoritative conclusion of                 Reasoning Sentence                 710
the trier of fact about whether a condition of a legal rule has been                  Legal-Rule Sentence                938
satisfied or not, given the evidence in the case. An example of a
                                                                                      Citation Sentence                  1,118
Finding Sentence is: “The most probative evidence fails to link the
                                                                                      Other Sentences                    478
Veteran's claimed acquired psychiatric disorder, including PTSD,
                                                                                      Total                              6,153
to active service or to his service-connected residuals of frostbite.”
(BVA1340434)1                                                                      Table 1. Frequency of Sentences in PTSD-Sent Dataset, by
    Evidence Sentence. An Evidence Sentence primarily states the                                       Rhetorical Type
content of the testimony of a witness, states the content of
documents introduced into evidence, or describes other evidence.                     For each rhetorical role, a protocol provides a detailed definition
Evidence sentences provide part of the premises for findings of                  of the role, as well as methods and criteria for manually classifying
fact. An example of an Evidence Sentence is: “The examiner who                   sentences, and illustrative examples. Such protocols furnish
conducted the February 2008 VA mental disorders examination                      materials not only for training annotators and for conducting
opined that the Veteran clearly had a preexisting psychiatric                    quality assurance of data validity, but also for developing rule-
disability when he entered service.” (BVA1303141)                                based scripts that help automate the classification process. In this


1
 We cite decisions by their BVA citation number, e.g., “BVA1302544”. Decisions
are available from the VA website: https://www.index.va.gov/search/va/bva.jsp.
 Automatic Classification of Rhetorical Roles for Sentences                              ASAIL 2019, June 2019, Montreal, QC, Canada

paper, we use initial caps in referring to a specific semantic type
                                                                         The veteran has a disability that is “service-connected”.
that is defined by a protocol (e.g., “Finding Sentence”), in contrast
                                                                             AND [1 of 3] The veteran has “a present disability”.
to a reference to a corresponding general concept (e.g., a finding
                                                                                  OR [1 of …] The veteran has “a present disability”
sentence). The protocols for these five rhetorical roles will be made             of posttraumatic stress disorder (PTSD), supported
publicly available, along with the PTSD-Sent dataset.                             by “medical evidence diagnosing the condition in
                                                                                  accordance with [38 C.F.R.] § 4.125(a)”.
3.1.2     “Finding Sentences” as Critical Connectors                              OR [2 of …] …
“Finding Sentences” (as defined in Section 3.1.1 above) are critical         AND [2 of 3] The veteran incurred “a particular injury
connectors in argument mining. They connect the relevant evidence            or disease … coincident with service in the Armed
                                                                             Forces, or if preexisting such service, [it] was aggravated
and related reasoning (which function as premises) to the
                                                                             therein”.
appropriate legal issue, and they state whether a proponent’s proof
                                                                                  OR [1 of …] The veteran’s disability claim is for
has been successful or not (the conclusion of the reasoning). Our                 service connection of posttraumatic stress disorder
experiments test the automatic classification of Finding Sentences,               (PTSD), and there is “credible supporting evidence
as distinct from the other sentence roles.                                        that the claimed in-service stressor occurred”.
    The governing substantive legal rules state the factual issues to             OR [2 of …] …
be proved – that is, the conditions under which the BVA is required          AND [3 of 3] There is “a causal relationship [“nexus”]
to order compensation, or the BVA is prohibited from ordering                between the present disability and the disease or injury
compensation. A legal rule can be represented as a set of                    incurred or aggravated during service”.
propositions, one of which is the conclusion and the remaining                    OR [1 of …] The veteran’s disability claim is for
propositions being the rule conditions [35, 18]. Each condition can               service connection of posttraumatic stress disorder
                                                                                  (PTSD), and there is “a link, established by medical
in turn function as a conclusion, with its own conditions nested
                                                                                  evidence, between current symptoms and an in-
within it [37]. The resulting set of nested conditions has a tree
                                                                                  service stressor”.
structure – with the entire representation of the applicable legal                OR [2 of …] …
rules being called a “rule tree” [35]. A rule tree integrates all the
governing rules from statutes, regulations, and case law into a
                                                                            Figure 1. High-Level Rule Tree for Proving a Service-
single, computable system of legal rules.                                       Connected Disability, and Specifically PTSD.
    Figure 1 shows the highest levels of the rule tree for proving
that a veteran’s PTSD is “service-connected”, and therefore eligible        Theorists on sample size for qualitative studies have determined
for compensation. As shown in Figure 1, there are three main rule       that the appropriate size depends upon many factors [24]. They
conditions that a veteran must prove (connected to the ultimate         recommend that researchers can stop adding to the observation
conclusion at the top by the logical connective “AND”), and within      sample once that sample has reached reasonable “saturation,” such
each branch there are specific conditions if the claim is for PTSD      that it is sufficiently information-rich and adding more members
(connected to the branch by “OR”, indicating that alternative           would be redundant. [24, 12] In the present study, rather than
disabilities may have their own particular rules). In a BVA decision    devising a metric for saturation, we decided to test our main
on such a disability claim, therefore, we expect the fact-finding       hypothesis by randomly drawing a very small sample of 5
reasoning to be organized around arguments and reasoning on these       decisions, analyzing the 58 sentences labeled as Finding Sentences
three PTSD rule conditions. Therefore, the rule tree governing a        in those decisions, forming hypotheses about predictive
legal adjudication (such as a BVA case) provides the issues to be       classification features, and testing the predictive power of those
proved, and an organization structure for classifying arguments or      features.
reasoning based on the evidence. The critical connectors between            The qualitative-study test sample (“QS-TS”) consists of the
the rule conditions of the rule tree and the evidence in a specific     remaining 45 BVA decisions from the PTSD dataset, excluding the
case are the Finding Sentences.                                         5 decisions we used to create the QS-OS dataset. As we formulated
                                                                        hypotheses about the classifying power of linguistic features based
3.2 The Qualitative Study Datasets                                      on the QS-OS, we tested those features quantitatively against the
From the common dataset of 50 BVA decisions we randomly drew            QS-TS. Within these 45 decisions, we isolated only the evidence
a set of 5 decisions to function as the qualitative-study               assessment portions of the decisions, the extended section under the
observation sample (“QS-OS”). The QS-OS is the sample of                heading “Reasons and Bases” for the findings. We call this set of
labeled sentences that we studied qualitatively to hypothesize          labeled sentences the “QS-TS-R&B”. This dataset contains 5,422
classification features for rhetorical roles. The QS-OS dataset         sentences, with the following frequencies for particular sentence
contains 530 sentences, with the following frequencies for              roles: Finding Sentences = 358, Evidence Sentences = 2,218,
particular sentence roles: Finding Sentences = 58, Evidence             Reasoning Sentences = 669, Legal-Rule Sentences = 857, Citation
Sentences = 201, Reasoning Sentences = 40, Legal-Rule Sentences         Sentences = 1,015, and other Sentences = 305. We used QS-TS-
= 81, Citation Sentences = 103, and other Sentences = 47.               R&B to test our observation-based hypotheses about predictive
                                                                        linguistic features.
 ASAIL 2019, June 2019, Montreal, QC, Canada                                                                              V.R. Walker et al.

3.3 The Machine Learning Dataset                                         Board); and (C) the attribution object, or the propositional content
                                                                         that we attribute to the attribution subject, expressed in normal form
For our ML experiments, we started with the entire PTSD-Sent
                                                                         by an embedded clause (in the example, the veteran currently has
dataset and performed certain preprocessing. We removed
                                                                         PTSD). We distinguish the attribution cues and attribution subjects,
sentences that are merely headings, as well as numeric strings in
                                                                         on the one hand, from the proposition being attributed. We call the
the data. All words that remained were stemmed using NLTKs
                                                                         former “finding-attribution cues” because a lawyer uses them to
Snowball stemmer. Since punctuation symbols such as hyphens
                                                                         determine whether a sentence states a finding of fact or not,
appear to interfere with the stemmer, we filtered out all non-
                                                                         regardless of which legal-rule condition might be at issue. The
alphabetic characters prior to the stemming step. If the filtering and
                                                                         proposition being attributed, on the other hand, is the content of the
stemming processes reduced a sentence to only blank characters,
                                                                         finding. In the example above, the finding-attribution cues are “The
the entire sentence was dropped. Importantly, English stop words
                                                                         Board finds that”, while the attribution object is the proposition
were not eliminated. Considering that each instance is a relatively
                                                                         “the veteran currently has PTSD.” An important reason for
short English sentence, eliminating any words might increase the
                                                                         separating these two categories and testing their performance
classification error rate.
                                                                         independently is that finding-attribution cues are more likely to be
    This preprocessing stage reduced the total data set to 5,797
                                                                         transferable to disabilities other than PTSD, and they are more
usable labeled sentences. The frequencies of sentence types after
                                                                         likely to have counterparts even in other areas of law.
preprocessing were: Finding Sentences = 490, Evidence Sentences
= 2,419, Reasoning Sentences = 710, Legal-Rule Sentences = 938,
                                                                         4.2 Experiments with Finding-Attribution Cues
Citation Sentences = 899, and other Sentences = 341.
    The features chosen for the machine learning algorithm were          We conducted a qualitative study of the finding-attribution cues
the individual tokens in all the sentences (3,476), and the bigrams      that occur within QS-OS, and ran various experiments to determine
(30,959) and trigrams (59,373) that appear in them. These features       how scripts built on those cues would perform against QS-TS-
also form the vocabulary for the vectorizer. We used the                 R&B. This section reports the results of several of those
CountVectorizer class of the Scikit-learn Machine Learning library       experiments, with the results tabulated in Table 2.
[25] as the feature extractor. The size of the vector was equal to the
vocabulary size (93,808). On average, each sentence had about 60         4.2.1     Experiments E1 and E1N
true entries.                                                            It appeared from the QS-OS that a highly-predictive single word
                                                                         might be “finds”. Although in this experiment we did not perform
4    Results of the Qualitative Study                                    part-of-speech tagging, the word “finds” is generally used as a main
This Section describes the experiments we conducted in the               verb (present tense, singular) when the Board states a finding. This
qualitative study, as well as the results of those experiments. As we    is contrasted with Evidence Sentences, in which the verb is
discussed in Section 3.2, the qualitative study was designed to test     generally in the past tense (e.g., “found”), and the sentence
our main hypothesis that we can use a very small observational           attributes a proposition to a witness or document in the evidentiary
sample (only 5 decisions, containing 530 labeled sentences) to           record. We also observed occurrences of “concludes” and “grants”
develop classifying scripts that perform reasonably well against the     used in the same way as “finds”. We ran these three alternatives as
remainder of the PTSD dataset (a test dataset of 45 decisions,           a single experiment, using the regular expression (finds | concludes
containing 5,422 labeled sentences), at least for purposes of some       | grants), with the results shown as E1 in Table 2.
use cases. We also use the qualitative study to test our second              As shown in Table 2, a common mis-classification in E1 was
hypothesis that attribution theory provides a general and                with Legal-Rule Sentences. In Section 4.3 below, we discuss why
transferable method for creating semantic types and linguistic           precision is important for our use cases. By examining the Legal-
features.                                                                Rule Sentences in QS-OS, we noted that, consistent with our main
                                                                         hypothesis, certain types of words and phrases occur in those
4.1 The Qualitative Study Methodology                                    sentences that we use to attribute them to legal authorities as
                                                                         sources of general legal rules. Such words and phrases include
In order to develop a systematic methodology for discovering
                                                                         indefinite noun phrases (such as “a veteran,” as contrasted with “the
linguistic features that might classify Finding Sentences, we used
                                                                         Veteran”), conditional terms (such as “if” and “when”), and words
attribution theory. An example of a sentence explicitly stating an
                                                                         typically used as cues for attributing propositions to higher courts
attribution relation is: The Board finds that the veteran currently
                                                                         (such as “held that” or “ruled that”). We tested scripts that used
has PTSD. In interpreting the meaning of this sentence, we attribute
                                                                         such words or phrases to exclude Legal-Rule Sentences from the
to “the Board” the conclusion that “the veteran currently has
                                                                         results of E1, with the results shown in Table 2 for E1N.
PTSD”. As illustrated in this example, attribution relations have at
least three elements or predicate arguments [22, 41]: (A) the
attribution cue that signals an attribution, and which provides the
                                                                         4.2.2     Experiments E2 and E2N
lexical grounds for making the attribution (in the example, finds        A primary strength of a qualitative study is being able to identify a
that); (B) the attribution subject, or the actor to which we attribute   phrase that might be highly predictive of Finding Sentences due to
the propositional content of the sentence (in the example, the           the legal meaning of the phrase. One such phrase is “preponderance
                                                                         of the evidence”, which is used to formulate the legal standard for
 Automatic Classification of Rhetorical Roles for Sentences                                ASAIL 2019, June 2019, Montreal, QC, Canada

finding a proposition to be a fact. An alternative phrase that is often       In addition, precision = 0.647 (for E1N+2N, Table 2) might be
used when assessing what the total evidence proves is “weight of          acceptable, because the false positives (sentences incorrectly
the evidence”. We ran scripts using these two alternatives against        classified as Finding Sentences) constituted only about 1/3 of the
QS-TS-R&B, with the results shown in Table 2 as E2.                       predicted sentences. Moreover, the largest number of mis-classified
                                                                          sentences occurred in Reasoning Sentences (68). This may be
                 E1      E1N      E2     E2N E1+2        E1N+2N
                                                                          because a judge might use a main verb such as “finds” when
  Finding        129     129      46      43  159          156
                                                                          reporting the Board’s intermediate reasoning about the credibility
  Evidence         3       3       0      0     3            3
                                                                          or persuasiveness of individual items of evidence. Of the
 Reasoning        67      66       2      2    69           68
                                                                          incorrectly classified sentences, about 80% were Reasoning
 Legal-Rule       14      10      18      2    30           12
  Citation         0       0       0      0     0            0            Sentences, which are probably also instructive to a user who is
   Other           1       1       1      1     2            2            looking for examples of arguments about evidence. For such a use
                                                                          case, a user might learn as much or more from reviewing a
   Recall       0.360 0.360 0.128 0.120 0.444              0.436          Reasoning Sentence as from reviewing a Finding Sentence, and
  Precision     0.603 0.617 0.687 0.896 0.605              0.647          confusion between these two rhetorical roles is less important. For
     F1         0.450 0.455 0.216 0.212 0.512              0.521          these use cases (semantic search and semantic viewer), the
  Table 2. Qualitative Study Test Results (Frequencies) for               performance of even these simple scripts could be very useful.
   Finding-Attribution Cues, by Sentence Rhetorical Role                      Contrast such use cases with a use case that calculates a
                                                                          probability of success for an argument pattern, based on historic
   As with experiment E1 above, the mis-classified Legal-Rule             results in decided cases. For such a use case, the validity of the
Sentences had the undesirable effect of lowering the precision of         probability would depend critically upon relative frequency in the
the script. By examining the Legal-Rule Sentences in QS-OS, we            database, and on high recall and precision of similar arguments
hypothesized that modal words or phrases, in addition to those            from past cases. Retrieving every similar case would be a priority
indefinite, conditional and attributional words and phrases               with a potentially significant cost of error – e.g., reliance on an
discussed in Section 4.2.1, could be used to exclude Legal-Rule           erroneous probability in deciding whether to bring or settle a new
Sentences. Examples of such modal phrases are “must determine”            legal case. Moreover, confusion between Finding Sentences (which
and “are not necessary.” Scripts including these four types of words      record whether an argument was successful or not) and any other
produced the results shown in Table 2 for E2N.                            rhetorical type of sentence could have significant consequences.
                                                                              Because we based the script development for these experiments
4.2.3     Experiments E1+2 and E1N+2N                                     on attribution theory, as well as on general concepts used to
In order to test a combination of scripts, we ran a script that           increase precision, we expect this methodology to be transferable
classified a sentence as a Finding Sentence if either E1 so classified    to other legal areas besides veterans’ disability claims.
it or E2 did so. The results are shown as E1+2 in Table 2. We also
ran a combined experiment, including the Legal-Rule Sentence              5    Results of the Machine Learning Study
exclusion scripts from E1 (E1N) and from E2 (E2N), with the               This Section describes the experiments we conducted in the ML
results shown as E1N+2N in Table 2.                                       study, as well as the results of those experiments. As described in
                                                                          Section 3.3, we filtered out certain sentences from the dataset, and
4.3 Discussion of the Qualitative Study                                   stemmed the words, leaving us with a preprocessed dataset of 5,797
We emphasize that we had a very limited objective in these                labeled sentences. Our goal was two-fold: to assess how well the
experiments: to test, in a preliminary way, whether we could use          chosen machine learning classifiers perform relative to each other
attribution theory to develop hand-crafted, rule-based scripts that       and to the qualitative-study scripts; and to find out which features
could perform adequately in a variety of important use cases. If we       were determined by each classifier as being significant to the
could observe useful linguistic patterns in only 5 decisions, we          prediction of Finding Sentences. The algorithms we chose for this
might be able to develop a general methodology that would be              study are Naive Bayes (NB), Logistic Regression (LR), and support
transferable to adjudicatory decisions in many areas of law.              vector machines (SVM) with a linear kernel [26, 9, 8].
    We also stress that whether performance is adequate is a                  We ran each ML algorithm 10 times, each run using a randomly
function of the end use case. For example, if the use case is to          chosen training subset that contained 90% of the labeled sentences.
retrieve similar cases and to highlight sentences by rhetorical type      The trained classifier was used to predict the labels for the
for the purpose of suggesting how similar evidence has been argued        remaining 10% of sentences. All results shown in this section are
in past cases, then the priority might be on precision over recall.       the averages over these 10 runs.
This is because wasting the user’s time with non-responsive returns           For each ML algorithm, we ran two sets of experiments. In the
might have a more serious cost than merely failing to retrieve all        first set of experiments (the “multi-class” experiments) we retained
similar cases. For such a use case, even recall = 0.436 (for E1N+2N,      the labels for all 5 sentence types in the PTSD-Sent dataset – i.e.,
Table 2) might be useful because nearly half of all Finding               each classifier was fit to a multi-class training set. We recorded the
Sentences were correctly identified (true positives).                     overall accuracy score (the fraction of correctly labeled test
                                                                          instances), the classification summary, and the confusion matrix for
 ASAIL 2019, June 2019, Montreal, QC, Canada                                                                                   V.R. Walker et al.

each algorithm and each run. The classification summary records            indicates that the classifier is likely to generate a number of false
the precision, recall and F1-score for each label. A confusion matrix      positives. The underlying issue is likely to be the strong assumption
cell-value C[i][j] is the number of test sentences that are known to       of conditional independence between the features. Finally, the
be in class i (row i) but are predicted by the classifier to be in class   inability of this model to indicate which features were most
j (column j). All values shown are averages over the 10 runs.              important in making the determination of Finding Sentences makes
    In the second set of experiments (the “two-class” experiment),         it an opaque classifier.
we labeled all sentences other than Finding Sentences as “Non-
Finding” sentences, so the training and test datasets then contained       5.2 Logistic Regression (LR)
only two classes. As before, we recorded the accuracy scores, the          The LR algorithm produces a binary classifier, also known as a log-
classification summaries, and the confusion matrices and averaged          linear classifier. Since the LR algorithm produces only binary
them over the runs of each algorithm. In addition, for the LR and          classifiers, for our multi-class experiments we used the one-versus-
SVM classifiers, we extracted the top 20 features as measured by           the-rest approach. Results are shown in Tables 6 – 8.
their weights in the fitted classifier. Note that since Finding                Discussion: The results show that LR is an acceptable classifier
Sentences form only about 8.5% of the dataset, the default classifier      for this problem. The two-class accuracy score of 96.3% (Table 4)
that labels all test instances as “Non-Finding” would have an              is better than that of the default classifier, although in this classifier
accuracy score of 91.5% (under reasonable assumptions about the            as well most of the accuracy score appears to come from the correct
distribution of sentences).                                                predictions of the Non-Finding Sentences. The two-class precision
    Table 4 summarizes the average accuracy for each classifier, for       of 0.84 for Finding Sentences (Table 8) indicates that false
each of the two sets of experiments. We also computed the false            positives are still a concern, though substantially lower than those
positive rates from the confusion matrices. The remainder of this          of the NB classifier. The confusion matrix did not indicate any
section of the paper reports details on each ML classifier.                dominant source of error. The words and phrases (stemmed) in the
                                                                           highest-ranked features were similar to those used in the hand-
                 Multi-        Multi-        Two-             Two-         scripted classifier.
 Algorithm
                  class         class         class           class
 / Metrics
                Accuracy      False-Pos     Accuracy        False-Pos                                   Precision    Recall    F1-score

     NB           81.7%          1.5%         93.4%           2.4%                     Citation           0.99        0.97        0.98
                                                                                       Evidence           0.87        0.94        0.91
     LR           85.7%          1.6%         96.3%           1.2%
                                                                                         Finding          0.81        0.78        0.79
    SVM           85.7%          1.6%         96.8%           1.2%
                                                                                      Legal-Rule          0.88        0.91        0.89
 Table 4. Average Accuracy and False-Positive Rates, Three                            Reasoning           0.66        0.52        0.58
            Classifiers, Two Sets of Experiments
                                                                                         Others           0.70        0.59        0.64
5.1 Naive Bayes (NB)                                                             Table 6. Logistic Regression Summary, Multi-Class
The Scikit-learn Python module has implementations of multiple
variants of the basic NB algorithm. We chose the GaussianNB                          C             E         F         L          R         O
implementation with default parameters to present results
                                                                             C      91.1          0.6       0.0       1.5        0.0        0.3
(implementation of ComplementNB yielded similar results).
Results for the two-class experiment are shown in Table 5.                   E       0.7       226.6        1.3       1.1        9.8        1.1
                                                                             F       0.0          3.1       37.7      1.5        4.5        1.4
                            Precision      Recall     F-1
                                                                             L       0.2          1.9       1.3       85.1       3.4        1.5
              Finding          0.64         0.48      0.54
                                                                             R       0.2        21.4        4.5       4.7       37.4        3.8
           Non-Finding         0.95         0.98      0.96
                                                                             O       0.2          6.2       1.8       3.3        1.7       19.1
   Table 5. Naive Bayes Classification Summary, Two-Class                   Table 7. Logistic Regression Confusion Matrix, Multi-Class
    Discussion: The results show that NB is not a preferable
classifier for this problem. While the overall accuracy for both the                                    Precision    Recall    F-1 Score
multi-class and two-class experiments appear to be acceptable
(Table 4), a closer look indicates these are substantial deficiencies                  Finding            0.84        0.69        0.75
in this classifier, especially for the important two-class case (Table
5). The two-class accuracy score of 93.4% (Table 4) is not a                        Non-Finding           0.97        0.99        0.98
significant improvement over the default classifier (with an
accuracy of 91.5%). The precision of 0.64 for Finding Sentences                   Table 8. Logistic Regression Summary, Two-Class
 Automatic Classification of Rhetorical Roles for Sentences                                ASAIL 2019, June 2019, Montreal, QC, Canada

5.3 Support Vector Machines (SVM)                                        examples of reasoning in similar cases. Given the generic nature of
                                                                         the scripts and the small sample of labeled decisions, there is reason
An SVM is an ML algorithm for binary classification problems. It
                                                                         to think that this methodology is transferable to other areas of law.
is based on finding a maximum margin hyperplane that divides the
                                                                         We plan to test this hypothesis in our future work.
training set into the two classes. Based on the success of the LR
                                                                             For the ML experiments, for each of 10 runs we employed 90%
classifier, we decided to use a linear kernel for the SVM. Since
                                                                         of 5,797 labeled sentences for training, and the other 10% for
SVM classifiers are by default binary, for the multi-class
                                                                         testing. While this quantity of training/testing data was 10 times the
experiment the implementation builds one-versus-one classifiers
                                                                         quantity of data used to construct the hand-crafted scripts, it is still
and a voting scheme is used to predict the label for a test instance.
                                                                         a smaller dataset than those on which ML models are typically
Some results are shown in Tables 9 and 10.
                                                                         based. The LR and SVM classifiers produced similar recall,
    Discussion: The results show that performance of the SVM
                                                                         precision and F1 scores for classifying Finding Sentences, in both
classifier with a linear kernel has similar performance to that of the
                                                                         the multi-class and two-class experiments. Either significantly
LR classifier. This is true for both the multi-class and the two-class
                                                                         outperformed the hand-crafted scripts in these metrics. However,
experiments. However, there is substantial divergence in the top
                                                                         we emphasize that we did not try to optimize the scripts that we
features chosen by the two algorithms. The features in common are
                                                                         tested. Our goal at this stage was to develop and test a methodology
“board find”, “thus” and “whether”. One hypothesis is that most of
                                                                         for writing such scripts, and to determine whether even basic scripts
the top features are used to decide the Non-Finding class labels, and
                                                                         could yield promising results for some use cases. A next step is to
the Finding class arises as a default class. Several of the highest-
                                                                         improve the performance of our scripts in those use cases. One
ranked features seemed to be specific for PTSD cases. Also, as with
                                                                         approach will be to employ part-of-speech tagging of at least
the LR classifier, the confusion matrix for the multi-class SVM did
                                                                         subjects and verbs, which may improve the predictive power of
not indicate any dominant source of classification error.
                                                                         script features by distinguishing between attribution subjects and
                                                                         cues, on the one hand, and attribution objects on the other.
                          Precision    Recall    F1-score                    A second approach will be to use our qualitative methodology
            Citation         0.98        0.96       0.98                 to write and test scripts for the other rhetorical roles. Our results
                                                                         here suggest, for example, that there are promising scripts for
           Evidence          0.88        0.94       0.91
                                                                         excluding many Legal-Rule Sentences from consideration as
            Finding          0.82        0.78        0.8                 Finding Sentences. We think that scripts can be written for
          Legal-Rule         0.90        0.90       0.90                 positively classifying Legal-Rule Sentences. For example, in
                                                                         addition to any lexical features, a Legal-Rule Sentence is generally
          Reasoning          0.65        0.53       0.58
                                                                         followed immediately by a Citation Sentence (or by intervening
           Sentence          0.63        0.63       0.62                 other Legal-Rule Sentences, and then a Citation Sentence).
                                                                         Moreover, Citation Sentences have very particular content and are
      Table 9. SVM Classification Summary, Multi-Class                   highly distinguishable. Attribution theory will also guide script
                                                                         development for classifying Evidence Sentences. Thus, a larger
                           Precision Recall F-1 Score                    qualitative study may lead to better-performing scripts.
                                                                             We also intend to combine high-performing scripts into a
             Finding         0.85       0.74       0.79
                                                                         pipeline that also includes ML or DL (deep-learning) classifiers.
          Non-Finding        0.98       0.99       0.98                  Scripts can add new and legally-significant labels to sentences,
                                                                         which can then provide input features for ML or DL classifiers.
      Table 10. SVM Classification Summary, Two-Class                    Training ML or DL classifiers on data partially annotated by scripts
                                                                         may improve their performance.
6    General Discussion and Future Work
The main hypothesis for our work was that Finding Sentences in           7    Conclusion
legal decisions contain such regular and distinctive phrasing that       We used attribution theory to develop a qualitative methodology
scripts written on a very small sample, as well as ML models             for analyzing a very small sample of labeled sentences to create
trained on larger but still relatively small samples, could perform      rule-based scripts that can classify sentences that state findings of
sufficiently well for many valuable use cases. The results of our        fact (“Finding Sentences”). We compared the results of those
preliminary experiments indicate that this hypothesis was correct,       scripts with the performance of standard ML algorithms trained and
for the reasons we began to discuss in Section 4.3.                      tested on a larger dataset, but one that is still a relatively small
    In the qualitative study, we used attribution theory to identify     dataset by ML standards. Both of these experiments suggest that
possible classification features from a very small set of 5 decisions,   some access-to-justice use cases can be adequately addressed with
and we tested our hypotheses on a larger set of 45 decisions. Using      very small quantities of labeled data, and at much lower cost than
attribution-finding cues and other general concepts, we developed        previously believed.
scripts that performed reasonably well for such use cases as
semantic search and semantic viewer, for the purpose of retrieving
 ASAIL 2019, June 2019, Montreal, QC, Canada                                                                                                       V.R. Walker et al.

                                                                                        [22] S. Pareti. 2011. Annotating Attribution Relations and Their Features. In
ACKNOWLEDGMENTS                                                                              Proceedings of the Fourth Workshop on Exploiting Semantic Annotations in
We thank the Maurice A. Deane School of Law for its support for                              Information Retrieval (ESAIR ’11) (Glasgow, Scotland, UK, October 28, 2011).
                                                                                             ACM, New York.
the Research Laboratory for Law, Logic and Technology.                                  [23] S. Pareti, T. O’Keefe, I. Konstas, J. R. Curran and I. Koprinska. 2013.
                                                                                             Automatically Detecting and Attributing Indirect Quotations. In Proceedings of
REFERENCES                                                                                   the 2013 Conference on Empirical Methods in Natural Language Processing
                                                                                             (Seattle, Washington, October 18-21, 2013), 989-999.
[1] L. Al-Abdulkarim, K. Atkinson and T. Bench-Capon. 2016. Statement Types in          [24] M. Patton. 1990. Qualitative Evaluation and Research Methods. Beverly Hills,
    Legal Argument. In Legal Knowledge and Information Systems (JURIX 2016),                 CA: Sage.
     Bex, F., and Villata, S., eds. IOS Press, 3-12.                                    [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel and B. Thirion. 2011.
[2] M. P. Allen. 2007. Significant Developments in Veterans Law (2004-2006) and              Scikit-learn: Machine Learning in Python. J. Machine Learning Res.12, 2825-
     What They Reveal about the U.S. Court of Appeals for Veterans Claims and the            2830.
     U.S. Court of Appeals for the Federal Circuit. University of Michigan Journal of   [26] S. E. Robertson and K. Spark Jones. 1976. Relevance Weighting of Search
     Law Reform 40, 483-568. University of Michigan.                                         Terms. J. American Society for Information Science 27(3), 129-146.
[3] K. D. Ashley and V. R. Walker. 2013. Toward Constructing Evidence-Based             [27] M. Saravanan and R. Ravindran. 2010. Identification of rhetorical roles for
     Legal Arguments Using Legal Decision Documents and Machine Learning. In                 segmentation and summarization of a legal judgment. Artificial Intelligence and
     Proceedings of the 14th International Conference on Artificial Intelligence and         Law, 18, 45–76.
     Law (ICAIL 2013). ACM, New York, NY, 176-180.                                      [28] J. Savelka, V. R. Walker, M. Grabmair and K. D. Ashley. 2017. Sentence
[4] K. D. Ashley and V. R. Walker. 2013. From Information Retrieval (IR) to                  Boundary Detection in Adjudicatory Decisions in the United States. Revue TAL,
     Argument Retrieval (AR) for Legal Cases: Report on a Baseline Study. In Legal           58(2), 21-45.
     Knowledge and Information Systems, Ashley, K. D., Ed. IOS Press, 29-38.            [29] O. Shulayeva, A. Siddharthan and A. Wyner. 2017. Recognizing cited facts and
[5] A. Bansal, Z. Bu, B. Mishra, S. Wang, K. Ashley, and M. Grabmai. 2016.                   principles in legal judgements. Artificial Intelligence and Law 25(1), 107-126.
     Document Ranking with Citation Information and Oversampling Sentence               [30] C. Stab and I. Gurevych. 2014. Identifying Argumentative Discourse Structures
     Classification in the LUIMA Framework. In Legal Knowledge and Information               in Persuasive Essays. In Proceedings of the 2014 Conference on Empirical
     Systems (JURIX 2016), Bex, F., and Villata, S., eds. IOS Press, 33-42.                  Methods in Natural Language Processing (EMNLP), 46-56, Doha, Qatar.
[6] Board of Veterans’ Appeals, U.S. Department of Veterans Affairs. 2018. Annual       [31] S. Teufel and M. Moens. 2002. Summarizing scientific articles: experiments with
     Report, Fiscal Year 2018.                                                               relevance and rhetorical status. Computational Linguistics, 28(4), 409-445.
[7] H. Bunt, R. Prasad and A. Joshi. 2012. First steps towards an ISO standard for      [32] H. Wachsmuth, M. Potthast, K. Al-Khatib, Y. Ajjour, J. Puschmann, J. Qu, J.
     annotating discourse relations. In Proceedings of the Joint ISA-7, SRSL-3, and          Dorsch, V. Morari, J. Bevendorff and B. Stein. 2017. Building an Argument
     I2MRT LREC 2012 Workshop on Semantic Annotation and the Integration and                 Search Engine for the Web. In Proceedings of the 4th Workshop on Argument
     Interoperability        of      Multimodal      Resources        and      Tools         Mining, 49-59, Copenhagen, Denmark.
     (Istanbul, Turkey, May 2012), 60-69.                                               [33] V. R. Walker. 2014. Representing the use of rule-based presumptions in legal
[8] C. Cortes and V. Vapnik. 1995. Support-Vector Networks. Machine Learning 20,             decision documents. Law, Probability and Risk, 13(3-4), 259-275. Oxford UP.
     273-297. Kluwer.                                                                   [34] V. R. Walker, P. Bagheri and A. J. Lauria. 2015. Argumentation Mining from
[9] R-E. Fan, K-W. Chang, C-J. Hsieh, X-R. Wang and C-J. Lin. 2008. LIBLINEAR:               Judicial Decisions: The Attribution Problem and the Need for Legal Discourse
     A Library for Large Linear Classification. J. Machine Learning Res. 9, 1871-            Models. Paper at the First Workshop on Automated Detection, Extraction and
     1874.                                                                                   Analysis of Semantic Information in Legal Texts (ASAIL 2015), San Diego,
[10] M. Grabmair, K. D. Ashley, R. Chen, P. Sureshkumar, C. Wang, E. Nyberg and              California, USA. URL: https://people.hofstra.edu/vern_r_walker/WalkerEtAl-
     V. R. Walker. 2015. Introducing LUIMA: An Experiment in Legal Conceptual                AttributionAndLegalDiscourseModels-ASAIL2015.pdf.
     Retrieval of Vaccine Injury Decisions Using a UIMA Type System and Tools. In       [35] V. R. Walker, N. Carie, C. C. DeWitt and E. Lesh. 2011. A framework for the
     Proceedings of the 15th International Conference on Artificial Intelligence &           extraction and modeling of fact-finding reasoning from legal decisions: lessons
     Law (ICAIL 2015), 69-78. ACM, New York.                                                 from the Vaccine/Injury Project Corpus. Artificial Intelligence and Law 19, 291-
[11] C. Grover, B. Hachey, I. Hughson and C. Korycinski. 2003. Automatic                     331.
     Summarization of Legal Documents. In Proceedings of the 9th International          [36] V. R. Walker, D. Foerster, J. M. Ponce and M. Rosen. 2018. Evidence Types,
     Conference on Artificial Intelligence and Law (ICAIL ’03), 243-251. ACM, New            Credibility Factors, and Patterns or Soft Rules for Weighing Conflicting
     York.                                                                                   Evidence: Argument Mining in the Context of Legal Rules Governing Evidence
[12] T. C. Guetterman, T. Chang, M. DeJonckheere, T. Basu, E. Scruggs and V. G.              Assessment. In Proceedings of the 5th Workshop on Argument Mining
     Vinod Vydiswaran. 2018. Augmenting Qualitative Text Analysis with Natural               (ArgMining 2018), 68-78. ACL.
     Language Processing: Methodological Study. J. Med. Internet Res. 20(6), e231.      [37] V. R. Walker, J. H. Han, X. Ni and K. Yoseda. 2017. Semantic Types for
[13] B. Hachey and C. Grover. 2006. Extractive summarization of legal texts.                 Computational Legal Reasoning: Propositional Connectives and Sentence Roles
     Artificial Intelligence and Law 14, 305–345.                                            in the Veterans’ Claims Dataset. In Proceedings of the 16th International
[14] R. Krestel, S. Bergler and R. Witte. 2008. Minding the Source: Automatic                Conference on Artificial Intelligence and Law (ICAIL ’17), 217-226. ACM, New
     Tagging of Reported Speech in Newspaper Articles. In Proceedings of the Sixth           York.
     International Language Resources and Evaluation Conference (LREC ’08)              [38] V. R. Walker, A. Hemendinger, N. Okpara and T. Ahmed. 2017. Semantic Types
     (Marrakech, Morocco, May 28-30, 2008), 2823-2828.                                       for Decomposing Evidence Assessment in Decisions on Veterans’ Disability
[15] J. Lawrence and C. Reed. 2017. Mining Argumentative Structure from Natural              Claims for PTSD. In Proceedings of the Second Workshop on Automatic
     Language Text Using Automatically Generated Premise-Conclusion Topic                    Semantic Analysis of Information in Legal Texts (ASAIL 2017), 10 pages,
     Models. In Proceedings of the 4th Workshop on Argument Mining, 39-48,                   London, UK.
     Copenhagen, Denmark.                                                               [39] B. Waltl, G. Bonczek, E. Scepankova and F. Matthes. 2019. Semantic types of
[16] E. de Maat, K. Krabben and R. Winkels. 2010. Machine Learning versus                    legal norms in German laws: classification and analysis using local linear
     Knowledge Based Classification of Legal Texts. In Proceedings of the 2010               explanations. Artificial Intelligence and Law 27, 43-71. Springer.
     Conference on Legal Knowledge and Information Systems (JURIX 2010), 87-96.         [40] D. Walton. 2009. Argumentation theory: A very short introduction. In Guillermo
[17] L. T. McCarty. 2007. Deep Semantic Interpretations of Legal Texts. In                   Simari and Iyad Rahwan, editors, Argumentation in Artificial Intelligence, 1-22.
     Proceedings of the 11th International Conference on Artificial Intelligence and         Springer, US.
     Law (ICAIL ’07), 217-224. ACM, New York.                                           [41] B. Webber and A. Joshi. 2012. Discourse Structure and Computation: Past,
[18] R. Mochales and M-F. Moens. 2011. Argumentation mining. Artificial                      Present and Future. In Proceedings of the ACL-2012 Special Workshop on
     Intelligence and Law 19, 1-22. Springer.                                                Rediscovering 50 Years of Discoveries (Jeju, Republic of Korea, July 10, 2012),
[19] M-F. Moens, E. Boiy, R. Mochales and C. Reed. 2007. Automatic Detection of              42-54.
     Arguments in Legal Texts. In Proceedings of the 11th International Conference
     on Artificial Intelligence and Law (ICAIL ’07), 225-230. ACM, New York.
[20] V. H. Moshiashwili. 2015. The Downfall of Auer Deference: Veterans Law at
     the Federal Circuit in 2014. American University Law Review 64, 1007-1087.
     American University.
[21] R. M. Palau and M-F Moens. 2009. Argumentation mining: the detection,
     classification and structure of arguments in text. In Proceedings of the 12th
     International Conference on Artificial Intelligence and Law (ICAIL 2009), 98–
     107, Barcelona, Spain.