Automatic Classification of Rhetorical Roles for Sentences: Comparing Rule-Based Scripts with Machine Learning Vern R. Walker Krishnan Pillaipakkamnatt Director, Research Laboratory for Law, Logic & Chair, Department of Computer Science Technology (LLT Lab) Fred DeMatteis School of Engineering and Applied Maurice A. Deane School of Law, Hofstra University Science, Hofstra University Hempstead, New York, USA Hempstead, New York, USA vern.r.walker@hofstra.edu krishnan.pillaipakkamnatt@hofstra.edu Alexandra M. Davidson Marysa Linares Domenick J. Pesce Research Laboratory for Law, Research Laboratory for Law, Research Laboratory for Law, Logic & Technology (LLT Lab) Logic & Technology (LLT Lab) Logic & Technology (LLT Lab) Maurice A. Deane School of Law, Maurice A. Deane School of Law, Maurice A. Deane School of Law, Hofstra University Hofstra University Hofstra University Hempstead, New York, USA Hempstead, New York, USA Hempstead, New York, USA lltlab@hofstra.edu lltlab@hofstra.edu lltlab@hofstra.edu ABSTRACT believed. The datasets, the protocols used to define sentence types, the scripts and ML codes will be publicly available. Automatically mining patterns of reasoning from evidence- intensive legal decisions can make legal services more efficient, ACM Reference format: and it can increase the public’s access to justice, through a range of Vern R. Walker, Krishnan Pillaipakkamnatt, Alexandra M. Davidson, use cases (including semantic viewers, semantic search, decision Marysa Linares and Domenick J. Pesce. 2019. Automatic Classification of summarizers, argument recommenders, and reasoning monitors). Rhetorical Roles for Sentences: Comparing Rule-Based Scripts with Important to these use cases is the task of automatically classifying Machine Learning. In Proceedings of the Third Workshop on Automated Semantic Analysis of Information in Legal Text (ASAIL 2019), Montreal, those sentences that state whether the conditions of applicable legal QC, Canada, 10 pages. rules have been satisfied or not in a particular legal case. However, insufficient quantities of gold-standard semantic data, and the high cost of generating such data, threaten to undermine the 1 Introduction development of such automatic classifiers. This paper tests two Automated argument mining would add greatly to productivity in hypotheses: whether distinctive phrasing enables the development legal practice, and it could increase access to justice in many legal of automatic classifiers on the basis of a small sample of labeled areas through a range of use cases. As a first use case, a web-based decisions, with adequate results for some important use cases, and semantic viewer might automatically highlight for the user those whether semantic attribution theory provides a general sentences and patterns that are of interest in argumentation. For methodology for developing such classifiers. The paper reports example, a semantic viewer might open a new decision document promising results from using a qualitative methodology to analyze and provide filters that the user could select to highlight only the a small sample of classified sentences (N = 530) to develop rule- conclusions, or the evidence, or the stated reasoning from evidence based scripts that can classify sentences that state findings of fact to conclusions. (“Finding Sentences”). We compare those results with the Second, we could create semantic search tools, which would use performance of standard machine learning (ML) algorithms trained components of reasoning to search through hundreds of thousands and tested on a larger dataset (about 5,800 labeled sentences), of decisions in plain-text format, retrieve those decisions similar to which is still relatively small by ML standards. This methodology a new case (e.g., those with similar issues to be proved, and similar and these test results suggest that some access-to-justice use cases types of evidence available for proving them), rank the similar can be adequately addressed at much lower cost than previously decisions in order of greatest similarity, and then display the portions to read (using a semantic viewer). __________________________________________________________________ Third, the capability of extracting from published decisions the In: Proceedings of the Third Workshop on Automated Semantic Analysis of major conclusions, the intermediate reasoning, and the evidentiary Information in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada. basis for a decision would also provide the components of an © 2019 Copyright held by the owner/author(s). Copying permitted for private and academic purposes. informative summary of that decision. A decision summarizer Published at http://ceur-ws.org ASAIL 2019, June 2019, Montreal, QC, Canada V.R. Walker et al. could analyze a decision and create a digest or summary for the classify argument types is therefore a problem that any effort to human reader. create gold-standard data must address. Fourth, an automatic argument miner could extract successful This paper reports on preliminary research that addresses all of and unsuccessful patterns of reasoning from thousands of past these problems. Our main hypothesis is that reports of fact-finding decisions, and then it could generate suggestions for new arguments in legal adjudicatory decisions might employ such regular and in new cases. Such an argument recommender could assist distinctive phrasing that even rule-based scripts based on a very attorneys, non-attorneys, and judges in processing new cases. small sample, as well as ML models trained on larger samples, can Fifth, such an automated argument miner could monitor cases perform adequately for many valuable use cases. Our second as they are being litigated, compare evolving arguments with hypothesis is that attribution theory from linguistics can be patterns of reasoning that have been successful or unsuccessful in extended to argument mining, as a method for creating semantic the past, and detect possible outliers (arguments likely to fail or types and automatically identifying them in legal decisions. There decisions likely to be incorrect). A reasoning monitor could also is reason to think that such an approach will be transferable to maintain statistics and trends for patterns of reasoning, and it could adjudicatory decisions in many substantive areas of law. predict probabilities of success for new cases. Using an annotated dataset of U.S. decisions as the gold Provided the data derived from argument mining is valid and standard, we investigated a methodology for qualitatively studying predictive of real-case outcomes, such tools could assist alternative a very small sub-sample of such decisions, developing rule-based dispute resolution and increase efficiency within the legal system. scripts, and quantitatively testing the script performance. We Such automated, evidence-based tools (semantic viewers, semantic compared those outcomes against the performance of standard search, decision summarizers, argument recommenders, and supervised ML models trained on larger samples from the same reasoning monitors) could also assist non-lawyers when they dataset. Our study addresses data quantity by employing very small represent themselves in cases where a lawyer is not available. datasets; it addresses data validity by employing quality-assurance Producing such a range of tools, however, faces several protocols and publishing those protocols and the resulting data; and challenges. One challenge is whether machine learning (ML) tools it addresses annotation type systems by explaining their derivation. can be effective at automating such argument mining. First, there is The paper reports promising results relative to important use cases, the problem of the available quantity of gold-standard data for and it lays out a methodology that should be transferable to many training and testing. Supervised ML may require such a large areas of law at a relatively small cost – thus helping to improve quantity of accurately labeled documents that there are not access to justice. We make publicly available the annotated dataset, sufficient resources to generate it, in all areas of law where the quality-assurance protocols, the scripts and the ML settings, at argument mining is desirable. While semi-supervised ML, using https://github.com/LLTLab/VetClaims-JSON. large quantities of unlabeled data and small quantities of labeled After we discuss prior related work in Section 2, we describe data, offers more promise, even that approach requires trained our dataset and how we used it in both our script and ML human classifiers. Especially in legal areas where outcomes may experiments (Section 3). Section 4 describes the qualitative-study not economically support the hiring of a lawyer (e.g., veterans’ experiments and their results, while Section 5 describes the ML disability claims or immigration asylum claims), there may be little experiments and their results. Section 6 contains our general financial incentive to create such a great quantity of ground-truth discussion of these combined results and our future work. annotated data. Moreover, some areas of law (such as vaccine- injury compensation decisions) may not even produce a sufficient 2 Prior Related Work quantity of cases bearing on a particular issue, even if we were to The context for the work reported in this paper is the goal of annotate it all. automated argument mining from adjudicatory legal decisions. Moreover, there is the challenge of ensuring data validity. In Such argument mining would automatically extract the evidence order to create effective tools, especially tools that predict future assessment and fact-finding reasoning found in adjudicatory outcomes given past decisions, the data upon which the tools are decisions, for the purpose of identifying successful and trained must accurately reflect what we believe it measures. But unsuccessful units of evidentiary argument. Researchers generally appropriately annotating the components of legal reasoning identify an argument as containing a conclusion or claim, together requires an adequate theory of legal reasoning, a sufficient number with a set of one or more premises. [21, 40, 30, 15, 32] of trained annotators, and adequate quality assurance. Also, to A first level of analysis is to classify the rhetorical roles of inspire trust in ML outputs, the models must be transparent and sentences for argument mining – that is, assigning sentences roles understandable. as either premise or conclusion. Prior work on classifying such Finally, there is the challenge of developing and testing rhetorical roles in adjudicatory decisions includes: applying adequate classification type systems for legal arguments [34, 38]. machine learning to annotate sentence types in vaccine-injury Unsupervised ML has the challenge of producing useful clusters, compensation decisions [3, 4, 5, 10]; assigning rhetorical roles to especially if the components of legal argument are poorly sentences in Indian court decisions [27]; classifying sentences as understood, the components vary depending on the use case, and argumentative in the Araucaria corpus, including newspapers and the components are different for different practitioners. How to court reports [19]; automatically summarizing legal judgments of the U.K.’s House of Lords, in part by classifying the rhetorical Automatic Classification of Rhetorical Roles for Sentences ASAIL 2019, June 2019, Montreal, QC, Canada status of sentences [13]; annotating sentences as containing legal example, the main verb of a finding sentence tends to be in present principles or facts in common-law reports [29]; and using statistical tense, while the main verbs of evidence sentences tend to be in past parsing as the input for computing quasi-logical forms as deep tense. Features derived from the protocols can drive the application semantic interpretations of sentences in U.S. appellate court of high-precision / low-recall techniques of the kind used decisions [17]. Al-Abdulkarim et al. provide one overview of successfully by [15], which we argue is the performance desired for statement types involved in legal reasoning in cases, from evidence certain use cases but not others. to a verdict [1]. The approach we describe in this paper utilizes a type system of rhetorical roles developed to annotate any fact- 3 The Datasets finding decision, and we compare script and ML classifiers for the We developed a common dataset to use in comparing the rhetorical roles of sentences. Moreover, it is not common for classification performance of rule-based script classifiers with the datasets to be publicly available, together with protocols for data performance of ML models. This section describes that dataset and generation, scripts and codes, to enable confirmation of data how it was used. accuracy and replication of results. A recent article compared two experiments in automated 3.1 The BVA PTSD Dataset classification of legal norms from German statutes, with regard to their semantic type: (1) a rule-based approach using hand-crafted We analyzed 50 fact-finding decisions issued by the U.S. Board of pattern definitions, and (2) an ML approach. [39] (For similar work Veterans’ Appeals (“BVA”) from 2013 through 2017 (the “PTSD on Dutch laws, see [16].) The performance metrics for the two dataset”). We arbitrarily selected those decisions from adjudicated experiments were comparable on a dataset of manually-labeled disability claims by veterans for service-related post-traumatic statements. While this study is highly relevant to our work, there stress disorder (PTSD). PTSD is a mental health problem that some are distinct differences. We develop a qualitative methodology for people develop after experiencing or witnessing a traumatic event, developing classification features of sentences in adjudicatory such as combat or sexual assault. Individual claims for decisions (not statutes), according to their rhetorical role (not norm compensation for a disability usually originate at a Regional Office type), for the purpose of automated argument mining. Our (“RO”) of the U.S. Department of Veterans Affairs (“VA”), or at methodology is general, and it should be transferable to another local office across the country [2, 20]. If the claimant is adjudicatory decisions in any substantive area of law. dissatisfied with the decision of the RO, she may file an appeal to To identify the rhetorical roles of sentences, we employ an the BVA, which is an administrative appellate body that has the extension of the semantic theory of attribution analysis. Attribution, authority to decide the facts of each case based on the evidence in the context of argument mining, is the descriptive task of [20]. The BVA must provide a written statement of the reasons or determining which actor is asserting, assuming or relying upon bases for its findings and conclusions, and that statement “must which propositions, in the course of presenting reasoning or account for the evidence which [the BVA] finds to be persuasive or argument. Although attribution is a classic problem area in natural unpersuasive, analyze the credibility and probative value of all language processing generally [7, 14, 22, 23], there has been material evidence submitted by and on behalf of a claimant, and limited work on attribution in respect to argument mining from provide the reasons for its rejection of any such evidence.” Caluza legal documents. Grover et al. reported on a project to annotate v. Brown, 7 Vet. App. 498, 506 (1995), aff’d, 78 F.3d 604 (Fed. Cir. sentences in House of Lords judgments for their argumentative 1996). roles [11]. Two tasks were to attribute statements to the Law Lord The veteran may appeal the BVA’s decision to the U.S. Court speaking about the case or to someone else (attribution), and to of Appeals for Veterans Claims (the “Veterans Court”) [20], but the classify sentences as formulating the law objectively vs. assessing standard of review for issues of fact is very deferential to the BVA. the law as favoring a conclusion or not favoring it (comparison). In order to set aside a finding of fact by the BVA, the Veterans This work extended the work of [31] on attribution in scientific Court must determine it to be “clearly erroneous.” [20] And articles. A broader discussion of attribution within the context of although either the claimant or the VA may appeal a Veterans Court legal decisions is found in [34]. Unlike the adjudicatory decisions decision to the U.S. Court of Appeals for the Federal Circuit, the used in our study, the House of Lords judgments studied by [11] Federal Circuit may only review questions of law, such as a treated facts as already settled in the lower courts. Our study constitutional challenge, or the interpretation of a statute or appears to be unique in using attribution analysis to help classify regulation relied upon by the Veterans Court. [2, 20] Except for the rhetorical roles of sentences in the evidence assessment portions constitutional issues, it “may not review any ‘challenge to a factual of adjudicatory texts. determination’ or any ‘challenge to a law or regulation as applied We have also developed classification protocols (classification to the facts of a particular case.’” Kalin v. Nicholson, 172 Fed.Appx. criteria and methods) for each rhetorical role. We use protocols to 1000, 1002 (Fed.Cir. 2006). Thus, the findings of fact made by the give precise meaning to the semantic type, to train new annotators, BVA are critical to the success or failure of a veteran’s claim. and to review the accuracy of human annotations. We also use such The BVA’s workload has increased dramatically in the past protocols to guide the development of the features or rule-based decade, reaching 85,288 decisions in fiscal year 2018. [6, p. 32] scripts for automatically classifying legal texts (e.g., [28]). Stab and The vast majority of appeals (96%) considered by the BVA involve Gurevych have classified such features into 5 groups [30]. For claims for compensation. [6, p. 31] Therefore, identifying the ASAIL 2019, June 2019, Montreal, QC, Canada V.R. Walker et al. patterns of factual reasoning within the decisions of the BVA Reasoning Sentence. A Reasoning Sentence primarily reports presents a significant challenge for automated argument mining. the trier of fact’s reasoning underlying the findings of fact For each of the 50 BVA decisions in our PTSD dataset, we (therefore, a premise). Such reasoning often involves an assessment extracted all sentences addressing the factual issues related to the of the credibility and probative value of the evidence. An example claim for PTSD, or for a closely-related psychiatric disorder. This of a Reasoning Sentence is: “Also, the clinician’s etiological set of sentences (“PTSD-Sent”) is the dataset on which we opinions are credible based on their internal consistency and her conducted our experiments. The “Reasons and Bases” section of duty to provide truthful opinions.” (BVA1340434) the decision is the longest section, containing the Board’s statement A unit of argument or reasoning within evidence assessment is of the evidence, its evaluation of that evidence, and its findings of usually composed of these three types of sentence (finding, fact on the relevant legal issues. evidence, and reasoning). The “Reasons and Bases” section of a BVA decision generally also includes two other types of sentence 3.1.1 Rhetorical Roles of Sentences in the PTSD- (those stating legal rules and citations), which must be Sent Dataset distinguished from the first three. Unlike the case-specific elements For the purpose of identifying reasoning or argument patterns, we of evidence, reasoning and findings, the legal rules and citations are focus primarily on sentences that play one of three rhetorical roles often the same for tens of thousands of cases, even though the in evidence assessment: the finding of fact, which states whether a sentences stating those rules and citations can be highly variable propositional condition of a legal rule is determined to be true, false linguistically, depending upon the writing style of the judge. or undecided; the evidence in the legal record on which the findings Legal-Rule Sentence. A Legal-Rule Sentence primarily states rest, such as the testimony of a lay witness or a medical record; and one or more legal rules in the abstract, without stating whether the the reasoning from the evidence to the findings of fact. Identifying conditions of the rule(s) are satisfied in the case being decided. An the sentences that have those roles within adjudicatory decisions, example of a Legal-Rule Sentence is: “Establishing direct service however, presents special problems. Such decisions have a wide connection generally requires medical or, in certain diversity of roles for sentences: e.g., stating the legal rules, policies circumstances, lay evidence of (1) a current disability; (2) an in- and principles applicable to the decision, as well as providing service incurrence or aggravation of a disease or injury; and (3) a citations to authority; stating the procedural history of the case, and nexus between the claimed in-service disease or injury and the the rulings on procedural issues; summarizing the evidence present disability.” (BVA1340434) presented and the arguments of the parties based on that evidence; Citation Sentence. A Citation Sentence references legal and stating and explaining the tribunal’s findings of fact based on authorities or other materials, and usually contains standard that evidence. [37] Thus, BVA decisions pose the challenge of notation that encodes useful information about the cited source. An classifying rhetorically important types of sentence and example is: “See Dalton v. Nicholson, 21 Vet. App. 23, 38 (2007); distinguishing them from other types of sentence. Caluza v. Brown, 7 Vet. App. 498, 511 (1995), aff'd per curiam, 78 The following are the 5 rhetorical roles that we used to classify F.3d 604 (Fed. Cir. 1996).” (BVA1340434) sentences in the PTSD-Sent dataset. Sentences were classified The frequencies of sentence rhetorical types within the PTSD- manually by teams of 2 trained law students, and they were curated Sent dataset are shown in Table 1. by a law professor with expertise in legal reasoning. Data validity Rhetorical Type Frequency is open to scrutiny because the data will be publicly available. Finding Sentence 490 Finding Sentence. A Finding Sentence is a sentence that Evidence Sentence 2,419 primarily states a “finding of fact” – an authoritative conclusion of Reasoning Sentence 710 the trier of fact about whether a condition of a legal rule has been Legal-Rule Sentence 938 satisfied or not, given the evidence in the case. An example of a Citation Sentence 1,118 Finding Sentence is: “The most probative evidence fails to link the Other Sentences 478 Veteran's claimed acquired psychiatric disorder, including PTSD, Total 6,153 to active service or to his service-connected residuals of frostbite.” (BVA1340434)1 Table 1. Frequency of Sentences in PTSD-Sent Dataset, by Evidence Sentence. An Evidence Sentence primarily states the Rhetorical Type content of the testimony of a witness, states the content of documents introduced into evidence, or describes other evidence. For each rhetorical role, a protocol provides a detailed definition Evidence sentences provide part of the premises for findings of of the role, as well as methods and criteria for manually classifying fact. An example of an Evidence Sentence is: “The examiner who sentences, and illustrative examples. Such protocols furnish conducted the February 2008 VA mental disorders examination materials not only for training annotators and for conducting opined that the Veteran clearly had a preexisting psychiatric quality assurance of data validity, but also for developing rule- disability when he entered service.” (BVA1303141) based scripts that help automate the classification process. In this 1 We cite decisions by their BVA citation number, e.g., “BVA1302544”. Decisions are available from the VA website: https://www.index.va.gov/search/va/bva.jsp. Automatic Classification of Rhetorical Roles for Sentences ASAIL 2019, June 2019, Montreal, QC, Canada paper, we use initial caps in referring to a specific semantic type The veteran has a disability that is “service-connected”. that is defined by a protocol (e.g., “Finding Sentence”), in contrast AND [1 of 3] The veteran has “a present disability”. to a reference to a corresponding general concept (e.g., a finding OR [1 of …] The veteran has “a present disability” sentence). The protocols for these five rhetorical roles will be made of posttraumatic stress disorder (PTSD), supported publicly available, along with the PTSD-Sent dataset. by “medical evidence diagnosing the condition in accordance with [38 C.F.R.] § 4.125(a)”. 3.1.2 “Finding Sentences” as Critical Connectors OR [2 of …] … “Finding Sentences” (as defined in Section 3.1.1 above) are critical AND [2 of 3] The veteran incurred “a particular injury connectors in argument mining. They connect the relevant evidence or disease … coincident with service in the Armed Forces, or if preexisting such service, [it] was aggravated and related reasoning (which function as premises) to the therein”. appropriate legal issue, and they state whether a proponent’s proof OR [1 of …] The veteran’s disability claim is for has been successful or not (the conclusion of the reasoning). Our service connection of posttraumatic stress disorder experiments test the automatic classification of Finding Sentences, (PTSD), and there is “credible supporting evidence as distinct from the other sentence roles. that the claimed in-service stressor occurred”. The governing substantive legal rules state the factual issues to OR [2 of …] … be proved – that is, the conditions under which the BVA is required AND [3 of 3] There is “a causal relationship [“nexus”] to order compensation, or the BVA is prohibited from ordering between the present disability and the disease or injury compensation. A legal rule can be represented as a set of incurred or aggravated during service”. propositions, one of which is the conclusion and the remaining OR [1 of …] The veteran’s disability claim is for propositions being the rule conditions [35, 18]. Each condition can service connection of posttraumatic stress disorder (PTSD), and there is “a link, established by medical in turn function as a conclusion, with its own conditions nested evidence, between current symptoms and an in- within it [37]. The resulting set of nested conditions has a tree service stressor”. structure – with the entire representation of the applicable legal OR [2 of …] … rules being called a “rule tree” [35]. A rule tree integrates all the governing rules from statutes, regulations, and case law into a Figure 1. High-Level Rule Tree for Proving a Service- single, computable system of legal rules. Connected Disability, and Specifically PTSD. Figure 1 shows the highest levels of the rule tree for proving that a veteran’s PTSD is “service-connected”, and therefore eligible Theorists on sample size for qualitative studies have determined for compensation. As shown in Figure 1, there are three main rule that the appropriate size depends upon many factors [24]. They conditions that a veteran must prove (connected to the ultimate recommend that researchers can stop adding to the observation conclusion at the top by the logical connective “AND”), and within sample once that sample has reached reasonable “saturation,” such each branch there are specific conditions if the claim is for PTSD that it is sufficiently information-rich and adding more members (connected to the branch by “OR”, indicating that alternative would be redundant. [24, 12] In the present study, rather than disabilities may have their own particular rules). In a BVA decision devising a metric for saturation, we decided to test our main on such a disability claim, therefore, we expect the fact-finding hypothesis by randomly drawing a very small sample of 5 reasoning to be organized around arguments and reasoning on these decisions, analyzing the 58 sentences labeled as Finding Sentences three PTSD rule conditions. Therefore, the rule tree governing a in those decisions, forming hypotheses about predictive legal adjudication (such as a BVA case) provides the issues to be classification features, and testing the predictive power of those proved, and an organization structure for classifying arguments or features. reasoning based on the evidence. The critical connectors between The qualitative-study test sample (“QS-TS”) consists of the the rule conditions of the rule tree and the evidence in a specific remaining 45 BVA decisions from the PTSD dataset, excluding the case are the Finding Sentences. 5 decisions we used to create the QS-OS dataset. As we formulated hypotheses about the classifying power of linguistic features based 3.2 The Qualitative Study Datasets on the QS-OS, we tested those features quantitatively against the From the common dataset of 50 BVA decisions we randomly drew QS-TS. Within these 45 decisions, we isolated only the evidence a set of 5 decisions to function as the qualitative-study assessment portions of the decisions, the extended section under the observation sample (“QS-OS”). The QS-OS is the sample of heading “Reasons and Bases” for the findings. We call this set of labeled sentences that we studied qualitatively to hypothesize labeled sentences the “QS-TS-R&B”. This dataset contains 5,422 classification features for rhetorical roles. The QS-OS dataset sentences, with the following frequencies for particular sentence contains 530 sentences, with the following frequencies for roles: Finding Sentences = 358, Evidence Sentences = 2,218, particular sentence roles: Finding Sentences = 58, Evidence Reasoning Sentences = 669, Legal-Rule Sentences = 857, Citation Sentences = 201, Reasoning Sentences = 40, Legal-Rule Sentences Sentences = 1,015, and other Sentences = 305. We used QS-TS- = 81, Citation Sentences = 103, and other Sentences = 47. R&B to test our observation-based hypotheses about predictive linguistic features. ASAIL 2019, June 2019, Montreal, QC, Canada V.R. Walker et al. 3.3 The Machine Learning Dataset Board); and (C) the attribution object, or the propositional content that we attribute to the attribution subject, expressed in normal form For our ML experiments, we started with the entire PTSD-Sent by an embedded clause (in the example, the veteran currently has dataset and performed certain preprocessing. We removed PTSD). We distinguish the attribution cues and attribution subjects, sentences that are merely headings, as well as numeric strings in on the one hand, from the proposition being attributed. We call the the data. All words that remained were stemmed using NLTKs former “finding-attribution cues” because a lawyer uses them to Snowball stemmer. Since punctuation symbols such as hyphens determine whether a sentence states a finding of fact or not, appear to interfere with the stemmer, we filtered out all non- regardless of which legal-rule condition might be at issue. The alphabetic characters prior to the stemming step. If the filtering and proposition being attributed, on the other hand, is the content of the stemming processes reduced a sentence to only blank characters, finding. In the example above, the finding-attribution cues are “The the entire sentence was dropped. Importantly, English stop words Board finds that”, while the attribution object is the proposition were not eliminated. Considering that each instance is a relatively “the veteran currently has PTSD.” An important reason for short English sentence, eliminating any words might increase the separating these two categories and testing their performance classification error rate. independently is that finding-attribution cues are more likely to be This preprocessing stage reduced the total data set to 5,797 transferable to disabilities other than PTSD, and they are more usable labeled sentences. The frequencies of sentence types after likely to have counterparts even in other areas of law. preprocessing were: Finding Sentences = 490, Evidence Sentences = 2,419, Reasoning Sentences = 710, Legal-Rule Sentences = 938, 4.2 Experiments with Finding-Attribution Cues Citation Sentences = 899, and other Sentences = 341. The features chosen for the machine learning algorithm were We conducted a qualitative study of the finding-attribution cues the individual tokens in all the sentences (3,476), and the bigrams that occur within QS-OS, and ran various experiments to determine (30,959) and trigrams (59,373) that appear in them. These features how scripts built on those cues would perform against QS-TS- also form the vocabulary for the vectorizer. We used the R&B. This section reports the results of several of those CountVectorizer class of the Scikit-learn Machine Learning library experiments, with the results tabulated in Table 2. [25] as the feature extractor. The size of the vector was equal to the vocabulary size (93,808). On average, each sentence had about 60 4.2.1 Experiments E1 and E1N true entries. It appeared from the QS-OS that a highly-predictive single word might be “finds”. Although in this experiment we did not perform 4 Results of the Qualitative Study part-of-speech tagging, the word “finds” is generally used as a main This Section describes the experiments we conducted in the verb (present tense, singular) when the Board states a finding. This qualitative study, as well as the results of those experiments. As we is contrasted with Evidence Sentences, in which the verb is discussed in Section 3.2, the qualitative study was designed to test generally in the past tense (e.g., “found”), and the sentence our main hypothesis that we can use a very small observational attributes a proposition to a witness or document in the evidentiary sample (only 5 decisions, containing 530 labeled sentences) to record. We also observed occurrences of “concludes” and “grants” develop classifying scripts that perform reasonably well against the used in the same way as “finds”. We ran these three alternatives as remainder of the PTSD dataset (a test dataset of 45 decisions, a single experiment, using the regular expression (finds | concludes containing 5,422 labeled sentences), at least for purposes of some | grants), with the results shown as E1 in Table 2. use cases. We also use the qualitative study to test our second As shown in Table 2, a common mis-classification in E1 was hypothesis that attribution theory provides a general and with Legal-Rule Sentences. In Section 4.3 below, we discuss why transferable method for creating semantic types and linguistic precision is important for our use cases. By examining the Legal- features. Rule Sentences in QS-OS, we noted that, consistent with our main hypothesis, certain types of words and phrases occur in those 4.1 The Qualitative Study Methodology sentences that we use to attribute them to legal authorities as sources of general legal rules. Such words and phrases include In order to develop a systematic methodology for discovering indefinite noun phrases (such as “a veteran,” as contrasted with “the linguistic features that might classify Finding Sentences, we used Veteran”), conditional terms (such as “if” and “when”), and words attribution theory. An example of a sentence explicitly stating an typically used as cues for attributing propositions to higher courts attribution relation is: The Board finds that the veteran currently (such as “held that” or “ruled that”). We tested scripts that used has PTSD. In interpreting the meaning of this sentence, we attribute such words or phrases to exclude Legal-Rule Sentences from the to “the Board” the conclusion that “the veteran currently has results of E1, with the results shown in Table 2 for E1N. PTSD”. As illustrated in this example, attribution relations have at least three elements or predicate arguments [22, 41]: (A) the attribution cue that signals an attribution, and which provides the 4.2.2 Experiments E2 and E2N lexical grounds for making the attribution (in the example, finds A primary strength of a qualitative study is being able to identify a that); (B) the attribution subject, or the actor to which we attribute phrase that might be highly predictive of Finding Sentences due to the propositional content of the sentence (in the example, the the legal meaning of the phrase. One such phrase is “preponderance of the evidence”, which is used to formulate the legal standard for Automatic Classification of Rhetorical Roles for Sentences ASAIL 2019, June 2019, Montreal, QC, Canada finding a proposition to be a fact. An alternative phrase that is often In addition, precision = 0.647 (for E1N+2N, Table 2) might be used when assessing what the total evidence proves is “weight of acceptable, because the false positives (sentences incorrectly the evidence”. We ran scripts using these two alternatives against classified as Finding Sentences) constituted only about 1/3 of the QS-TS-R&B, with the results shown in Table 2 as E2. predicted sentences. Moreover, the largest number of mis-classified sentences occurred in Reasoning Sentences (68). This may be E1 E1N E2 E2N E1+2 E1N+2N because a judge might use a main verb such as “finds” when Finding 129 129 46 43 159 156 reporting the Board’s intermediate reasoning about the credibility Evidence 3 3 0 0 3 3 or persuasiveness of individual items of evidence. Of the Reasoning 67 66 2 2 69 68 incorrectly classified sentences, about 80% were Reasoning Legal-Rule 14 10 18 2 30 12 Citation 0 0 0 0 0 0 Sentences, which are probably also instructive to a user who is Other 1 1 1 1 2 2 looking for examples of arguments about evidence. For such a use case, a user might learn as much or more from reviewing a Recall 0.360 0.360 0.128 0.120 0.444 0.436 Reasoning Sentence as from reviewing a Finding Sentence, and Precision 0.603 0.617 0.687 0.896 0.605 0.647 confusion between these two rhetorical roles is less important. For F1 0.450 0.455 0.216 0.212 0.512 0.521 these use cases (semantic search and semantic viewer), the Table 2. Qualitative Study Test Results (Frequencies) for performance of even these simple scripts could be very useful. Finding-Attribution Cues, by Sentence Rhetorical Role Contrast such use cases with a use case that calculates a probability of success for an argument pattern, based on historic As with experiment E1 above, the mis-classified Legal-Rule results in decided cases. For such a use case, the validity of the Sentences had the undesirable effect of lowering the precision of probability would depend critically upon relative frequency in the the script. By examining the Legal-Rule Sentences in QS-OS, we database, and on high recall and precision of similar arguments hypothesized that modal words or phrases, in addition to those from past cases. Retrieving every similar case would be a priority indefinite, conditional and attributional words and phrases with a potentially significant cost of error – e.g., reliance on an discussed in Section 4.2.1, could be used to exclude Legal-Rule erroneous probability in deciding whether to bring or settle a new Sentences. Examples of such modal phrases are “must determine” legal case. Moreover, confusion between Finding Sentences (which and “are not necessary.” Scripts including these four types of words record whether an argument was successful or not) and any other produced the results shown in Table 2 for E2N. rhetorical type of sentence could have significant consequences. Because we based the script development for these experiments 4.2.3 Experiments E1+2 and E1N+2N on attribution theory, as well as on general concepts used to In order to test a combination of scripts, we ran a script that increase precision, we expect this methodology to be transferable classified a sentence as a Finding Sentence if either E1 so classified to other legal areas besides veterans’ disability claims. it or E2 did so. The results are shown as E1+2 in Table 2. We also ran a combined experiment, including the Legal-Rule Sentence 5 Results of the Machine Learning Study exclusion scripts from E1 (E1N) and from E2 (E2N), with the This Section describes the experiments we conducted in the ML results shown as E1N+2N in Table 2. study, as well as the results of those experiments. As described in Section 3.3, we filtered out certain sentences from the dataset, and 4.3 Discussion of the Qualitative Study stemmed the words, leaving us with a preprocessed dataset of 5,797 We emphasize that we had a very limited objective in these labeled sentences. Our goal was two-fold: to assess how well the experiments: to test, in a preliminary way, whether we could use chosen machine learning classifiers perform relative to each other attribution theory to develop hand-crafted, rule-based scripts that and to the qualitative-study scripts; and to find out which features could perform adequately in a variety of important use cases. If we were determined by each classifier as being significant to the could observe useful linguistic patterns in only 5 decisions, we prediction of Finding Sentences. The algorithms we chose for this might be able to develop a general methodology that would be study are Naive Bayes (NB), Logistic Regression (LR), and support transferable to adjudicatory decisions in many areas of law. vector machines (SVM) with a linear kernel [26, 9, 8]. We also stress that whether performance is adequate is a We ran each ML algorithm 10 times, each run using a randomly function of the end use case. For example, if the use case is to chosen training subset that contained 90% of the labeled sentences. retrieve similar cases and to highlight sentences by rhetorical type The trained classifier was used to predict the labels for the for the purpose of suggesting how similar evidence has been argued remaining 10% of sentences. All results shown in this section are in past cases, then the priority might be on precision over recall. the averages over these 10 runs. This is because wasting the user’s time with non-responsive returns For each ML algorithm, we ran two sets of experiments. In the might have a more serious cost than merely failing to retrieve all first set of experiments (the “multi-class” experiments) we retained similar cases. For such a use case, even recall = 0.436 (for E1N+2N, the labels for all 5 sentence types in the PTSD-Sent dataset – i.e., Table 2) might be useful because nearly half of all Finding each classifier was fit to a multi-class training set. We recorded the Sentences were correctly identified (true positives). overall accuracy score (the fraction of correctly labeled test instances), the classification summary, and the confusion matrix for ASAIL 2019, June 2019, Montreal, QC, Canada V.R. Walker et al. each algorithm and each run. The classification summary records indicates that the classifier is likely to generate a number of false the precision, recall and F1-score for each label. A confusion matrix positives. The underlying issue is likely to be the strong assumption cell-value C[i][j] is the number of test sentences that are known to of conditional independence between the features. Finally, the be in class i (row i) but are predicted by the classifier to be in class inability of this model to indicate which features were most j (column j). All values shown are averages over the 10 runs. important in making the determination of Finding Sentences makes In the second set of experiments (the “two-class” experiment), it an opaque classifier. we labeled all sentences other than Finding Sentences as “Non- Finding” sentences, so the training and test datasets then contained 5.2 Logistic Regression (LR) only two classes. As before, we recorded the accuracy scores, the The LR algorithm produces a binary classifier, also known as a log- classification summaries, and the confusion matrices and averaged linear classifier. Since the LR algorithm produces only binary them over the runs of each algorithm. In addition, for the LR and classifiers, for our multi-class experiments we used the one-versus- SVM classifiers, we extracted the top 20 features as measured by the-rest approach. Results are shown in Tables 6 – 8. their weights in the fitted classifier. Note that since Finding Discussion: The results show that LR is an acceptable classifier Sentences form only about 8.5% of the dataset, the default classifier for this problem. The two-class accuracy score of 96.3% (Table 4) that labels all test instances as “Non-Finding” would have an is better than that of the default classifier, although in this classifier accuracy score of 91.5% (under reasonable assumptions about the as well most of the accuracy score appears to come from the correct distribution of sentences). predictions of the Non-Finding Sentences. The two-class precision Table 4 summarizes the average accuracy for each classifier, for of 0.84 for Finding Sentences (Table 8) indicates that false each of the two sets of experiments. We also computed the false positives are still a concern, though substantially lower than those positive rates from the confusion matrices. The remainder of this of the NB classifier. The confusion matrix did not indicate any section of the paper reports details on each ML classifier. dominant source of error. The words and phrases (stemmed) in the highest-ranked features were similar to those used in the hand- Multi- Multi- Two- Two- scripted classifier. Algorithm class class class class / Metrics Accuracy False-Pos Accuracy False-Pos Precision Recall F1-score NB 81.7% 1.5% 93.4% 2.4% Citation 0.99 0.97 0.98 Evidence 0.87 0.94 0.91 LR 85.7% 1.6% 96.3% 1.2% Finding 0.81 0.78 0.79 SVM 85.7% 1.6% 96.8% 1.2% Legal-Rule 0.88 0.91 0.89 Table 4. Average Accuracy and False-Positive Rates, Three Reasoning 0.66 0.52 0.58 Classifiers, Two Sets of Experiments Others 0.70 0.59 0.64 5.1 Naive Bayes (NB) Table 6. Logistic Regression Summary, Multi-Class The Scikit-learn Python module has implementations of multiple variants of the basic NB algorithm. We chose the GaussianNB C E F L R O implementation with default parameters to present results C 91.1 0.6 0.0 1.5 0.0 0.3 (implementation of ComplementNB yielded similar results). Results for the two-class experiment are shown in Table 5. E 0.7 226.6 1.3 1.1 9.8 1.1 F 0.0 3.1 37.7 1.5 4.5 1.4 Precision Recall F-1 L 0.2 1.9 1.3 85.1 3.4 1.5 Finding 0.64 0.48 0.54 R 0.2 21.4 4.5 4.7 37.4 3.8 Non-Finding 0.95 0.98 0.96 O 0.2 6.2 1.8 3.3 1.7 19.1 Table 5. Naive Bayes Classification Summary, Two-Class Table 7. Logistic Regression Confusion Matrix, Multi-Class Discussion: The results show that NB is not a preferable classifier for this problem. While the overall accuracy for both the Precision Recall F-1 Score multi-class and two-class experiments appear to be acceptable (Table 4), a closer look indicates these are substantial deficiencies Finding 0.84 0.69 0.75 in this classifier, especially for the important two-class case (Table 5). The two-class accuracy score of 93.4% (Table 4) is not a Non-Finding 0.97 0.99 0.98 significant improvement over the default classifier (with an accuracy of 91.5%). The precision of 0.64 for Finding Sentences Table 8. Logistic Regression Summary, Two-Class Automatic Classification of Rhetorical Roles for Sentences ASAIL 2019, June 2019, Montreal, QC, Canada 5.3 Support Vector Machines (SVM) examples of reasoning in similar cases. Given the generic nature of the scripts and the small sample of labeled decisions, there is reason An SVM is an ML algorithm for binary classification problems. It to think that this methodology is transferable to other areas of law. is based on finding a maximum margin hyperplane that divides the We plan to test this hypothesis in our future work. training set into the two classes. Based on the success of the LR For the ML experiments, for each of 10 runs we employed 90% classifier, we decided to use a linear kernel for the SVM. Since of 5,797 labeled sentences for training, and the other 10% for SVM classifiers are by default binary, for the multi-class testing. While this quantity of training/testing data was 10 times the experiment the implementation builds one-versus-one classifiers quantity of data used to construct the hand-crafted scripts, it is still and a voting scheme is used to predict the label for a test instance. a smaller dataset than those on which ML models are typically Some results are shown in Tables 9 and 10. based. The LR and SVM classifiers produced similar recall, Discussion: The results show that performance of the SVM precision and F1 scores for classifying Finding Sentences, in both classifier with a linear kernel has similar performance to that of the the multi-class and two-class experiments. Either significantly LR classifier. This is true for both the multi-class and the two-class outperformed the hand-crafted scripts in these metrics. However, experiments. However, there is substantial divergence in the top we emphasize that we did not try to optimize the scripts that we features chosen by the two algorithms. The features in common are tested. Our goal at this stage was to develop and test a methodology “board find”, “thus” and “whether”. One hypothesis is that most of for writing such scripts, and to determine whether even basic scripts the top features are used to decide the Non-Finding class labels, and could yield promising results for some use cases. A next step is to the Finding class arises as a default class. Several of the highest- improve the performance of our scripts in those use cases. One ranked features seemed to be specific for PTSD cases. Also, as with approach will be to employ part-of-speech tagging of at least the LR classifier, the confusion matrix for the multi-class SVM did subjects and verbs, which may improve the predictive power of not indicate any dominant source of classification error. script features by distinguishing between attribution subjects and cues, on the one hand, and attribution objects on the other. Precision Recall F1-score A second approach will be to use our qualitative methodology Citation 0.98 0.96 0.98 to write and test scripts for the other rhetorical roles. Our results here suggest, for example, that there are promising scripts for Evidence 0.88 0.94 0.91 excluding many Legal-Rule Sentences from consideration as Finding 0.82 0.78 0.8 Finding Sentences. We think that scripts can be written for Legal-Rule 0.90 0.90 0.90 positively classifying Legal-Rule Sentences. For example, in addition to any lexical features, a Legal-Rule Sentence is generally Reasoning 0.65 0.53 0.58 followed immediately by a Citation Sentence (or by intervening Sentence 0.63 0.63 0.62 other Legal-Rule Sentences, and then a Citation Sentence). Moreover, Citation Sentences have very particular content and are Table 9. SVM Classification Summary, Multi-Class highly distinguishable. Attribution theory will also guide script development for classifying Evidence Sentences. Thus, a larger Precision Recall F-1 Score qualitative study may lead to better-performing scripts. We also intend to combine high-performing scripts into a Finding 0.85 0.74 0.79 pipeline that also includes ML or DL (deep-learning) classifiers. Non-Finding 0.98 0.99 0.98 Scripts can add new and legally-significant labels to sentences, which can then provide input features for ML or DL classifiers. Table 10. SVM Classification Summary, Two-Class Training ML or DL classifiers on data partially annotated by scripts may improve their performance. 6 General Discussion and Future Work The main hypothesis for our work was that Finding Sentences in 7 Conclusion legal decisions contain such regular and distinctive phrasing that We used attribution theory to develop a qualitative methodology scripts written on a very small sample, as well as ML models for analyzing a very small sample of labeled sentences to create trained on larger but still relatively small samples, could perform rule-based scripts that can classify sentences that state findings of sufficiently well for many valuable use cases. The results of our fact (“Finding Sentences”). We compared the results of those preliminary experiments indicate that this hypothesis was correct, scripts with the performance of standard ML algorithms trained and for the reasons we began to discuss in Section 4.3. tested on a larger dataset, but one that is still a relatively small In the qualitative study, we used attribution theory to identify dataset by ML standards. Both of these experiments suggest that possible classification features from a very small set of 5 decisions, some access-to-justice use cases can be adequately addressed with and we tested our hypotheses on a larger set of 45 decisions. Using very small quantities of labeled data, and at much lower cost than attribution-finding cues and other general concepts, we developed previously believed. scripts that performed reasonably well for such use cases as semantic search and semantic viewer, for the purpose of retrieving ASAIL 2019, June 2019, Montreal, QC, Canada V.R. Walker et al. [22] S. Pareti. 2011. Annotating Attribution Relations and Their Features. In ACKNOWLEDGMENTS Proceedings of the Fourth Workshop on Exploiting Semantic Annotations in We thank the Maurice A. Deane School of Law for its support for Information Retrieval (ESAIR ’11) (Glasgow, Scotland, UK, October 28, 2011). ACM, New York. the Research Laboratory for Law, Logic and Technology. [23] S. Pareti, T. O’Keefe, I. Konstas, J. R. Curran and I. Koprinska. 2013. Automatically Detecting and Attributing Indirect Quotations. In Proceedings of REFERENCES the 2013 Conference on Empirical Methods in Natural Language Processing (Seattle, Washington, October 18-21, 2013), 989-999. [1] L. Al-Abdulkarim, K. Atkinson and T. Bench-Capon. 2016. Statement Types in [24] M. Patton. 1990. Qualitative Evaluation and Research Methods. Beverly Hills, Legal Argument. In Legal Knowledge and Information Systems (JURIX 2016), CA: Sage. Bex, F., and Villata, S., eds. IOS Press, 3-12. [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel and B. Thirion. 2011. [2] M. P. Allen. 2007. Significant Developments in Veterans Law (2004-2006) and Scikit-learn: Machine Learning in Python. J. Machine Learning Res.12, 2825- What They Reveal about the U.S. Court of Appeals for Veterans Claims and the 2830. U.S. Court of Appeals for the Federal Circuit. University of Michigan Journal of [26] S. E. Robertson and K. Spark Jones. 1976. Relevance Weighting of Search Law Reform 40, 483-568. University of Michigan. Terms. J. American Society for Information Science 27(3), 129-146. [3] K. D. Ashley and V. R. Walker. 2013. Toward Constructing Evidence-Based [27] M. Saravanan and R. Ravindran. 2010. Identification of rhetorical roles for Legal Arguments Using Legal Decision Documents and Machine Learning. In segmentation and summarization of a legal judgment. Artificial Intelligence and Proceedings of the 14th International Conference on Artificial Intelligence and Law, 18, 45–76. Law (ICAIL 2013). ACM, New York, NY, 176-180. [28] J. Savelka, V. R. Walker, M. Grabmair and K. D. Ashley. 2017. Sentence [4] K. D. Ashley and V. R. Walker. 2013. From Information Retrieval (IR) to Boundary Detection in Adjudicatory Decisions in the United States. Revue TAL, Argument Retrieval (AR) for Legal Cases: Report on a Baseline Study. In Legal 58(2), 21-45. Knowledge and Information Systems, Ashley, K. D., Ed. IOS Press, 29-38. [29] O. Shulayeva, A. Siddharthan and A. Wyner. 2017. Recognizing cited facts and [5] A. Bansal, Z. Bu, B. Mishra, S. Wang, K. Ashley, and M. Grabmai. 2016. principles in legal judgements. Artificial Intelligence and Law 25(1), 107-126. Document Ranking with Citation Information and Oversampling Sentence [30] C. Stab and I. Gurevych. 2014. Identifying Argumentative Discourse Structures Classification in the LUIMA Framework. In Legal Knowledge and Information in Persuasive Essays. In Proceedings of the 2014 Conference on Empirical Systems (JURIX 2016), Bex, F., and Villata, S., eds. IOS Press, 33-42. Methods in Natural Language Processing (EMNLP), 46-56, Doha, Qatar. [6] Board of Veterans’ Appeals, U.S. Department of Veterans Affairs. 2018. Annual [31] S. Teufel and M. Moens. 2002. Summarizing scientific articles: experiments with Report, Fiscal Year 2018. relevance and rhetorical status. Computational Linguistics, 28(4), 409-445. [7] H. Bunt, R. Prasad and A. Joshi. 2012. First steps towards an ISO standard for [32] H. Wachsmuth, M. Potthast, K. Al-Khatib, Y. Ajjour, J. Puschmann, J. Qu, J. annotating discourse relations. In Proceedings of the Joint ISA-7, SRSL-3, and Dorsch, V. Morari, J. Bevendorff and B. Stein. 2017. Building an Argument I2MRT LREC 2012 Workshop on Semantic Annotation and the Integration and Search Engine for the Web. In Proceedings of the 4th Workshop on Argument Interoperability of Multimodal Resources and Tools Mining, 49-59, Copenhagen, Denmark. (Istanbul, Turkey, May 2012), 60-69. [33] V. R. Walker. 2014. Representing the use of rule-based presumptions in legal [8] C. Cortes and V. Vapnik. 1995. Support-Vector Networks. Machine Learning 20, decision documents. Law, Probability and Risk, 13(3-4), 259-275. Oxford UP. 273-297. Kluwer. [34] V. R. Walker, P. Bagheri and A. J. Lauria. 2015. Argumentation Mining from [9] R-E. Fan, K-W. Chang, C-J. Hsieh, X-R. Wang and C-J. Lin. 2008. LIBLINEAR: Judicial Decisions: The Attribution Problem and the Need for Legal Discourse A Library for Large Linear Classification. J. Machine Learning Res. 9, 1871- Models. Paper at the First Workshop on Automated Detection, Extraction and 1874. Analysis of Semantic Information in Legal Texts (ASAIL 2015), San Diego, [10] M. Grabmair, K. D. Ashley, R. Chen, P. Sureshkumar, C. Wang, E. Nyberg and California, USA. URL: https://people.hofstra.edu/vern_r_walker/WalkerEtAl- V. R. Walker. 2015. Introducing LUIMA: An Experiment in Legal Conceptual AttributionAndLegalDiscourseModels-ASAIL2015.pdf. Retrieval of Vaccine Injury Decisions Using a UIMA Type System and Tools. In [35] V. R. Walker, N. Carie, C. C. DeWitt and E. Lesh. 2011. A framework for the Proceedings of the 15th International Conference on Artificial Intelligence & extraction and modeling of fact-finding reasoning from legal decisions: lessons Law (ICAIL 2015), 69-78. ACM, New York. from the Vaccine/Injury Project Corpus. Artificial Intelligence and Law 19, 291- [11] C. Grover, B. Hachey, I. Hughson and C. Korycinski. 2003. Automatic 331. Summarization of Legal Documents. In Proceedings of the 9th International [36] V. R. Walker, D. Foerster, J. M. Ponce and M. Rosen. 2018. Evidence Types, Conference on Artificial Intelligence and Law (ICAIL ’03), 243-251. ACM, New Credibility Factors, and Patterns or Soft Rules for Weighing Conflicting York. Evidence: Argument Mining in the Context of Legal Rules Governing Evidence [12] T. C. Guetterman, T. Chang, M. DeJonckheere, T. Basu, E. Scruggs and V. G. Assessment. In Proceedings of the 5th Workshop on Argument Mining Vinod Vydiswaran. 2018. Augmenting Qualitative Text Analysis with Natural (ArgMining 2018), 68-78. ACL. Language Processing: Methodological Study. J. Med. Internet Res. 20(6), e231. [37] V. R. Walker, J. H. Han, X. Ni and K. Yoseda. 2017. Semantic Types for [13] B. Hachey and C. Grover. 2006. Extractive summarization of legal texts. Computational Legal Reasoning: Propositional Connectives and Sentence Roles Artificial Intelligence and Law 14, 305–345. in the Veterans’ Claims Dataset. In Proceedings of the 16th International [14] R. Krestel, S. Bergler and R. Witte. 2008. Minding the Source: Automatic Conference on Artificial Intelligence and Law (ICAIL ’17), 217-226. ACM, New Tagging of Reported Speech in Newspaper Articles. In Proceedings of the Sixth York. International Language Resources and Evaluation Conference (LREC ’08) [38] V. R. Walker, A. Hemendinger, N. Okpara and T. Ahmed. 2017. Semantic Types (Marrakech, Morocco, May 28-30, 2008), 2823-2828. for Decomposing Evidence Assessment in Decisions on Veterans’ Disability [15] J. Lawrence and C. Reed. 2017. Mining Argumentative Structure from Natural Claims for PTSD. In Proceedings of the Second Workshop on Automatic Language Text Using Automatically Generated Premise-Conclusion Topic Semantic Analysis of Information in Legal Texts (ASAIL 2017), 10 pages, Models. In Proceedings of the 4th Workshop on Argument Mining, 39-48, London, UK. Copenhagen, Denmark. [39] B. Waltl, G. Bonczek, E. Scepankova and F. Matthes. 2019. Semantic types of [16] E. de Maat, K. Krabben and R. Winkels. 2010. Machine Learning versus legal norms in German laws: classification and analysis using local linear Knowledge Based Classification of Legal Texts. In Proceedings of the 2010 explanations. Artificial Intelligence and Law 27, 43-71. Springer. Conference on Legal Knowledge and Information Systems (JURIX 2010), 87-96. [40] D. Walton. 2009. Argumentation theory: A very short introduction. In Guillermo [17] L. T. McCarty. 2007. Deep Semantic Interpretations of Legal Texts. In Simari and Iyad Rahwan, editors, Argumentation in Artificial Intelligence, 1-22. Proceedings of the 11th International Conference on Artificial Intelligence and Springer, US. Law (ICAIL ’07), 217-224. ACM, New York. [41] B. Webber and A. Joshi. 2012. Discourse Structure and Computation: Past, [18] R. Mochales and M-F. Moens. 2011. Argumentation mining. Artificial Present and Future. In Proceedings of the ACL-2012 Special Workshop on Intelligence and Law 19, 1-22. Springer. Rediscovering 50 Years of Discoveries (Jeju, Republic of Korea, July 10, 2012), [19] M-F. Moens, E. Boiy, R. Mochales and C. Reed. 2007. Automatic Detection of 42-54. Arguments in Legal Texts. In Proceedings of the 11th International Conference on Artificial Intelligence and Law (ICAIL ’07), 225-230. ACM, New York. [20] V. H. Moshiashwili. 2015. The Downfall of Auer Deference: Veterans Law at the Federal Circuit in 2014. American University Law Review 64, 1007-1087. American University. [21] R. M. Palau and M-F Moens. 2009. Argumentation mining: the detection, classification and structure of arguments in text. In Proceedings of the 12th International Conference on Artificial Intelligence and Law (ICAIL 2009), 98– 107, Barcelona, Spain.