Optimizing Language Models for Argumentative Reasoning Luke Thorburn1,2,∗ , Ariel Kruger1 1 Hunt Lab, University of Melbourne, Parkville, Victoria 3010, Australia 2 King’s College London, Strand, London WC2R 2LS Abstract Large transformer-based causal language models are capable of strong performance on many natural language processing tasks. Here, we systematically evaluate the performance of the 2.7 billion parameter GPT Neo pre-trained language model on 6 argumentative reasoning tasks under 5 different optimization strategies, including prompt programming, soft prompts, and parameter tuning. We report both intrinsic evaluation metrics (perplexity), and extrinsic measures of the coherence of model outputs, as judged by an expert human rater. With a few exceptions, the rate at which models produced coherent responses ranged from 15-50%. In contrast, human performance (users of the Kialo argument mapping platform) ranged from 65-82% coherent, depending on the task. These results suggest that larger, suitably optimized language models may be capable of supporting authors and auditors of natural language argument maps in human-in-the-loop settings. We share our finetuned models and code. Keywords language model, argument generation, finetuning, soft prompt 1. Introduction Since Douglas Engelbart first envisaged software for authoring structured argumentation [1], the goal of an algorithm that can automatically check human reasoning—a “spell checker for logic”—has been discussed. More broadly, knowledge workers of many types (including academics, risk analysts, and intelligence analysts) stand to benefit from tools that help them reliably and efficiently construct coherent arguments. One approach to realizing such tools is to integrate automated reasoning algorithms with argument mapping software, an approach that we have taken recently1 . There are many specific argumentative tasks that it may be useful to automate, such as the generation of reasons and objections, the identification of unstated premises, and the process of “tightening up” an argument by rewording premises to best represent the most argumentatively appropriate claim. In resource-constrained settings, it may only be possible to automate these 1st International Workshop on Argumentation and Machine Learning @COMMA 2022, September 13, 2022, Cardiff, Wales ∗ Corresponding author. Envelope-Open luke.thorburn@kcl.ac.uk (L. Thorburn); ariel.kruger@unimelb.edu.au (A. Kruger) GLOBE https://lukethorburn.com/ (L. Thorburn) Orcid 0000-0003-4120-5056 (L. Thorburn); 0000-0002-0121-2780 (A. Kruger) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 Our prototype “argument processor”, which integrates with the commercial OpenAI language model API, can be found at https://luke-thorburn.github.io/argument-processor/. numerous tasks if there is a common, accessible method that can be applied to all of them. In this paper, we investigate the extent to which an open-source pre-trained language model, the 2.7 billion parameter version of GPT Neo [2], can be optimized to perform such argumentative reasoning. 1.1. Background Large transformer-based causal language models are capable of strong performance on many natural language processing (NLP) tasks [3, 4]. This flexibility is made possible by the generality of causal language modeling—the task of predicting what text comes next, conditioned on the text that has come before. Any task that can be articulated as a natural language prompt followed by a response can be posed to a causal language model. For this reason, pre-trained language models can in some cases serve as few-shot or zero-shot learners [5, 6, 7, 8] by including instructions or examples of the task as part of the prompt [8]. Performance can often be improved further by tuning some or all of the model weights [9, 10]. Previous academic work has investigated whether language models can emulate different types of logical reasoning. For example, El Baff et al. [11] use language models to synthesize arguments for or against a given claim by training a language model over a vocabulary of curated argumentative discourse units. Clark et al. [12] find that when facts and rules are presented in natural language, a language model can reason over a knowledge base with high accuracy. Gurcke et al. [13] evaluate whether premises are sufficient to draw a conclusion by comparing the stated conclusion with one generated by a language model. Skitalinskaya et al. [14] tune a language model to evaluate the quality of argumentative claims. Other work has investigated the ability of language models to identify logical fallacies [15]. Increasingly, commercial prototypes are also demonstrating the potential for language models to assist with human reasoning tasks in a human-in-the-loop manner. A prominent example is Elicit [16], an “AI research assistant” powered by OpenAI’s GPT-3 language model that includes tools to brainstorm counterarguments, increase the specificity of claims, and suggest antecedents and consequences to expand a partial chain of reasoning. Without providing a comprehensive review, we note that there are other approaches to automating natural language reasoning that do not rely solely on language models. One prominent example is Project Debator [17, 18], an attempt to build an autonomous agent that can compete in formal debate. 1.2. Contribution In this project, we systematically explored the performance of the GPT Neo pre-trained language model on 6 argumentative reasoning tasks, under 5 different optimization strategies. The tasks, described in Section 2.2, correspond to tasks commonly performed by an analyst in the course of authoring an argument map. We report both intrinsic evaluation metrics (perplexity), and extrinsic measures of the coherence of model outputs, as judged by a human rater. To our knowledge, this is the first systematic evaluation of a large (>109 parameter) language model on argumentative reasoning tasks, despite the success of such large models elsewhere in NLP [5, 19]. Our results form a baseline for future work, and provide insight into which optimization strategies are most successful. In addition, we are releasing the finetuned models, along with code for performing optimization and inference, to aid future research2 . 2. Inputs In this section we describe the pre-trained foundation model we used, the NLP tasks for which we optimized it, and the datasets used for tuning. 2.1. Foundation model As our foundation model we used the 2.7 billion parameter version of the open-source GPT Neo language model [2]. GPT Neo has a transformer-based decoder architecture designed to replicate that of GPT-3 [5], albeit with fewer and smaller layers than are implemented in the largest version of GPT-3. For details on the architecture of GPT Neo, we direct the reader to the original papers on the GPT series of models [20, 21, 5]. GPT Neo was pre-trained on ‘The Pile’, an 800GB corpus of diverse text collated for language modeling [22]. 2.2. Tasks We investigated six argumentative NLP tasks, which are described in Table 1. Tasks 1-5 could be described as a type of “masked argument modeling” or cloze completion. They take as inputs a small argument map—potentially a subset of a much larger map—with one or more claims missing (strictly one in all cases except s u g g e s t - i n t e r m e d i a r y - c l a i m s , where arbitrarily many may be missing). The task is then to generate a claim or claims that could coherently fill the gap. The five tasks differ in the type and position of the claim that has been masked (whether it is a reason, co-premise, conclusion, etc.). All tasks investigated are generative, and intended to aid an analyst as they construct an argument map by (a) improving the efficiency with which they can compose relevant claims, and (b) prompting them to consider counter-arguments or implicit assumptions that they may not otherwise have identified. The optimized models are intended to be integrated with argument mapping tools where a hi-tree data structure [23] can be assumed. This avoids the need to rely on an imperfect argument mining pipeline to extract such a structure from argumentation presented in prose. Task 6, s u g g e s t - a b s t r a c t i o n , corresponds to a common step in the process of refining an ill-formed argument map. Often a premise will be too specific in relation to the target claim to best characterize the nature of the logical relationship between them. In such circumstances, the analyst should revise the claim to describe a more general or abstract inferential principle, which is the task we try to automate. Consider the following example, taken from [24]. Nuclear power is generated by the breakdown of heavy elements. ⟹ Nuclear power has very low greenhouse gas emissions. 2 The code, along with details of how to download the models, can be found at https://github.com/Hunt-Laboratory/ language-model-optimization. Table 1 The 6 argumentative NLP tasks we investigated. All tasks are posed in the context of an argument map, assumed to be a set of claims with a hi-tree structure [23]. A reason is a claim that supports another claim, an objection is a claim that attacks another claim, and co-premises are claims that jointly, but not separately, support another claim. A conclusion or target claim is a claim for which there exist reasons, objections or co-premises. ID Task Description 1 suggest-reasons Given a target claim and (optionally) some reasons, suggest (additional) reasons. 2 suggest-objections Given a claim and (optionally) some objections, suggest (additional) objections. 3 suggest-conclusion Given one or more reasons, suggest a claim that can be inferred from the reasons. 4 suggest-intermediary-claims Given a starting claim (a reason) and an end claim, suggest an expanded sequence of claims containing intermediary inferences between the start claim and the end claim. 5 suggest-copremise Given a claim and one or more co-premises, suggest additional co-premises required for the inference to be valid. 6 suggest-abstraction Given a claim and a reason, suggest a more abstract version of the reason that better represents the logical relationship between the reason and the claim. This argument is valid in a vague sense, but the premise is overly narrow and does not well represent the core reason it supports the conclusion. A better version would be: The physics of nuclear power generation involves no combustion. ⟹ Nuclear power has very low greenhouse gas emissions. It is this specific type of revision that the s u g g e s t - a b s t r a c t i o n task is intended to perform, under the assumption that such an abstraction is required. To formulate all of these tasks so they can be presented to a language model, we format them as a text prompt, the completion of which constitutes a response to the task. For example, the s u g g e s t - r e a s o n s task might be formulated as follows. List reasons why: Reasons: * * * * The model must then generate < R E A S O N 4 > to complete the prompt3 . 3 The prompt templates used can be found at https://github.com/Hunt-Laboratory/language-model-optimization. 2.3. Data Training, validation, and test datasets for each task, where possible, were generated from a collection of argument maps scraped from the online collaborative argument mapping platform Kialo4 . The scrape was performed in January 2020 by Lenz et al. [25], and the data was provided to us on request. The dataset contains 180,736 claims across 563 argument maps, centered on contentions such as “Traditional bullfighting ... should be banned”, “Climate change can be reversed”, and “Darwinian evolution is philosophy not science.” The maps have the structure of a simple argumentation framework: a tree where vertices are claims and edges denote pro or con relations. The maps underwent several preprocessing steps, the most noteworthy of which are summarized below. • Maps were randomly assigned to training (60%), validation (20%), and test (20%) sets. • Claims falling above a certain distance from the root claim in each map were filtered out because qualitative exploration suggested that the quality and coherence of claims decreased the further you go from the root claim. These claims have likely received less scrutiny on Kialo because they are less salient in the user interface. The depth of truncation differs slightly for each task, and is specified in Table 2. • All forks (a claim with its children) and branches (a sequence of supporting claims from a leaf to the root of a tree) were extracted for each map. • Depending on the task, the forks and branches were randomly or deterministically subsetted further to generate a greater number of training, validation, and test examples. For example, if a fork contained multiple reasons, any subset of those reasons could be included in a prompt for the s u g g e s t - r e a s o n s task, and any one of those reasons could be held out to serve as the “correct” completion for the purposes of supervised training and evaluation. This subsetting process leads to a combinatorial explosion in the number of candidate examples, so random selection was used to limit the size of the dataset, where resource constraints required it. At most, the training sets were limited to 50,000 examples, and the validation and test sets to 10,000 examples. The final number of examples in each of the dataset splits for each task are provided in Table 2. Note that for both the s u g g e s t - c o p r e m i s e and s u g g e s t - a b s t r a c t i o n tasks, there is no training or validation data available because Kialo does not support co-premises or tag revisions of claims that could be considered abstractions. For the same reasons, the test set for these tasks is unlabelled, so performance on this task can only be evaluated by human raters. 3. Optimization In this section we describe the strategies, software and hardware used for optimizing the foundation model. 4 See https://www.kialo.com/. Table 2 The sizes of the datasets for each task we investigated. The asterisk on the test set for the final two tasks indicates that it was unlabelled, so could only be used when manually evaluating model outputs. Dataset size Task Truncation Depth Training Validation Test suggest-reasons 4 50,000 10,000 10,000 suggest-objections 4 50,000 10,000 10,000 suggest-conclusion 6 15,250 4,503 4,823 suggest-intermediary-claims 4 29,894 6,514 8,766 suggest-copremise 4 0 0 1,043* suggest-abstraction 4 0 0 8,766* 3.1. Strategies There is a rapidly growing literature on strategies for optimizing pre-trained language models to perform specific tasks. Prompt programming refers to one family of methods, in which the pre-trained model weights are taken as fixed but the structure of the text prompt is tweaked to improve the quality of the output [26]. Exploration of different prompt formats can be systematic, but there is an art to crafting better prompts, guided by heuristics. So-called zero-shot prompts only contain task-specific instructions and the details of the specific instance of the task being performed. In contrast, few-shot prompts contain additional complete examples of the task being performed. In one context, the performance gains afforded by the inclusion of an additional example within a few-shot prompt have been observed to be roughly equivalent to tuning the full model on 100 examples [27]. In other contexts, zero-shot prompts significantly outperform few-shot prompts [8]. When using few-shot prompts, systematic experimentation can help determine which examples to include in the prompt [28], and in what order to list them [29]. Another prompt programming strategy is to formulate tasks in a common template, such as question answering [30] or textual entailment [10]. In a related approach, a short sequence of additional tokens with randomly initialized embed- dings (known as a soft prompt) is prepended to a minimal input prompt containing the details of the specific instance of the task to be performed. The embeddings for these additional tokens are tuned in a supervised fashion while all other model parameters remain fixed [31, 32, 33]. In this way, the “wording” of the prompt can be continuously optimized using conventional gradient-based optimization methods. There are a number of proposed strategies for selectively tuning a subset of the parameters in the main body of the model, short of tuning the full model. For example, tuning only the bias parameters can be both computationally efficient and effective [34, 9]. Alternately, meta-tuning describes the approach in which the foundation model is first tuned for a general task such as instruction following or question answering, before being applied to specific tasks of interest [30, 6]. Given our focus on exploring the applicability of pre-trained language models across multiple argumentative reasoning tasks—particularly in data-limited settings—we selected 5 optimization strategies that we evaluated (data and funding permitting) for each of the 6 tasks. Informed by the above literature, the strategies evaluated were: • zero-shot prompt, no tuning • few-shot prompt, no tuning • soft prompt + zero-shot prompt, only the soft prompt tuned • zero-shot prompt, only bias parameters tuned • zero-shot prompt, all parameters tuned 3.2. Software Model tuning, evaluation, and text generation were performed in Python using PyTorch [35] via the Hugging Face transformers library [36], with some custom class extensions to account for bespoke data loading, logging of evaluation metrics, and the insertion of soft prompts. We used the WarmupLR learning rate scheduler with the AdamW optimizer, a batch size of 32, and continued tuning until the validation loss (evaluated relatively infrequently due to computational cost) was observed to increase. The code for creating, tuning, and generating text in the presence of soft prompts was adapted from Parker [37]. Training large neural networks while avoiding out-of-memory errors can be challenging. To manage parallel and efficient training and evaluation of GPT-Neo on limited hardware we used the Zero Redundancy Optimizer (ZeRO, stage 2), as implemented in the DeepSpeed library [38]. Within this framework, optimizer states and model weights are partitioned across parallel processes such that each process updates only its partition, and retains only the gradients corresponding to its portion of the optimizer states, whilst also offloading optimizer memory and computation to the CPU. This avoided out-of-memory errors and allowed training to be performed. 3.3. Hardware Training and evaluation were performed remotely on a commercial virtual machine with four NVIDIA Quadro RTX 6000 GPUs, twenty-four AMD EPYC 7502P (2.50 GHz) virtual CPUs, and 2.78TB of storage. In total, the virtual machine had 96GB of virtual RAM (24GB per GPU), and 184 GB of conventional RAM. Total cloud compute costs were about USD 1250, and the tuning for all tasks and strategies took place over 228 hours. 4. Evaluation In this section we describe the methods used to evaluate each optimization strategy on each task, along with our results. 4.1. Methods Where possible, we evaluated each application of an optimization strategy to a task using both automated (intrinsic) and manual (extrinsic) methods. The inclusion of manual evaluation was to provide more interpretable insight into the quality of the model outputs, and to allow us to evaluate model outputs on unsupervised tasks where only prompts—not responses—were available (namely, s u g g e s t - c o p r e m i s e and s u g g e s t - a b s t r a c t i o n ). 4.1.1. Automated For each task that included responses in the test set, we calculated the perplexity of each approach across all examples in the test set. Perplexity is a standard evaluation measure for language models in the supervised setting and is equal to the exponential of the average cross- entropy across tokens in the evaluation text, relative to the model. The lower the perplexity of a sequence of tokens, the greater the likelihood the model assigns to that sequence. Perplexity is not a measure of reasoning quality specifically, but in this context captures how likely the model was to generate the “correct”, human-written argumentative claims that form the gold standard responses to the prompts. We calculated perplexity in two ways: (a) using the mean cross-entropy across all tokens (in both the prompts and responses) and (b) using the mean cross-entropy across only the response tokens (though the prompts were still fed through the model to allow it to condition on them). Perplexity across all the tokens is substantially lower because they contain repetitive boilerplate (which could be memorized during tuning) and, further, this measure corresponds to the loss function on which the models were tuned, where such tuning occurred. That said, the second perplexity measure (considering only the response tokens) is perhaps more meaningful, given that we ultimately care about generating unknown outputs and not regenerating the input prompts. 4.1.2. Manual Manual evaluation was performed using a bespoke method and rubric. We randomly sampled 100 examples from the test set for each task. Then, under all optimization strategies for each task, we generated sample responses of length 150 tokens for each of these 100 examples, conditioned on the appropriately formatted prompt. The temperature used for generation was 0.9. These generated outputs were cleaned according to the following rules. • Rounded parentheses and their contents were removed, along with asterisks, underscores, backticks, any leading numbered list indices (e.g. “1 . ”), trailing whitespace, and all characters before the first letter, number, or quotation mark. • The text was split into lines, and lines into sentences. Only the first line was retained and, unless the task was s u g g e s t - i n t e r m e d i a r y - c l a i m s , only the first sentence on the first line. The first letter was capitalized. • If the task was s u g g e s t - i n t e r m e d i a r y - c l a i m s , the string was split on the claim delimiter (either “ ~ ” or “ = > ”), and the first and last claims were removed on the assumption that Table 3 The rating rubric used by raters to evaluate coherence of suggestions generated by the model. 1 Incoherent − Suggestion (as written) is not relevant or coherent, and there is no insight to be gained from it. 2 Incoherent + Suggestion (as written) is not relevant or coherent, but the suggestion prompts the user to think of adjacent ideas or suggestions that are relevant and coherent. 3 Coherent − Suggestion (as written) is relevant and coherent, but some editing is required to be usable. 4 Coherent + Suggestion (as written) is relevant and coherent, and would be usable as written. they had been correctly reproduced from the input prompt. The generated outputs were pooled with the human-generated claims from Kialo (to provide a human benchmark) and sorted according to randomly assigned IDs for each test example, such that tasks and strategies appeared in a random order, but outputs for each example task appeared consecutively. The example tasks and responses were presented to raters in this order (to reduce cognitive switching costs), and formatted as small argument maps using a custom interface5 , in which the claims to be rated were highlighted. Raters were blind to the source of the highlighted claims. Each generated output was rated for coherence, using the rubric in Table 3. In this context, a claim is understood to be coherent (either “Coherent−” or “Coherent+”) if it is (a) able to be understood, and (b) is logically consistent with neighboring claims, in the manner implied by its position in the argument map. Note, claims can be coherent without being true6 . We chose to evaluate coherence because it arguably represents the minimum requirement for generated model outputs to be useful to a human analyst, and is a dimension of quality against which all six tasks can be evaluated. The rubric and the rating interface were developed over two pilot rating rounds. Each model output was rated once by one of two raters (the authors of this study), both of whom are familiar with argument mapping conventions. 4.2. Results In all training runs, learning was successful with an initial rapid decrease in loss, followed by a plateauing of the loss functions on both the training and validation sets. Severe overfitting (e.g. a U-shaped validation loss curve) was not seen within the training durations observed. As more parameters were made available for tuning, fewer examples were needed to reach the point of (mild) overfitting. For example, when tuning all parameters in the model, the model started to overfit less than 10% of the way through the first epoch. In contrast, the soft prompt and bias-only tuning strategies took at least one full epoch to reach the point of overfitting. 5 A variant of the interface at https://luke-thorburn.github.io/argument-processor/. 6 We originally included truth as a second dimension in the rating rubric. However, in practice the truth status of most model suggestions was either not able to be assessed (because they were normative or nonsensical), or was not verifiable in the time available. 4.2.1. Automated The results of the automated evaluation are shown in Table 4. Across all strategy/task combina- tions studied, full text perplexity ranges between about 5 and 25. For reference, GPT-Neo-2.7B achieves a perplexity of 5.646 on its test set from The Pile [2]. The zero-shot, no tuning strategy consistently performed the most poorly, whilst the soft prompt strategy produced the lowest perplexities observed across all tasks. When considering perplexity calculated only using the response tokens, the picture changes. In general, perplexity values are much higher here due to the exclusion of repetitive prompt boilerplate. The soft prompt strategy still performs competitively, but it is the few-shot strategy with no tuning that produced the lowest perplexity across all tasks. In contrast, the strategies with greater numbers of parameters tuned often performed more poorly, especially so in the case of the bias-only tuning strategy. This may be because they overfit to the prompt boilerplate at an earlier checkpoint, but this was not noticed in the training metrics because they only included the full text (both prompt and response). 4.2.2. Manual The results of the manual evaluation are shown in Table 5. In general, the picture emerging from the manual evaluations is less clear than that of the automated evaluations, with different optimization strategies performing best on different tasks. This may in part be due to the relatively small sample of 100 examples rated, as well as noise due to the imperfect reliability of the rating scale. With a few exceptions, the rate at which models produced coherent responses ranged roughly between 15% and 50%, which may be acceptable in a human-in-the-loop setting where multiple suggestions can be generated concurrently and those that are incoherent quickly discarded. Notably, the relatively more subtle tasks s u g g e s t - a b s t r a c t i o n and s u g g e s t - c o p r e m i s e achieved coherence rates of 18% and 25% using merely a few-shot prompt and no parameter tuning. We also rated the original human outputs from Kialo, where available, to provide a benchmark. Whilst better than all the language model outputs, the human benchmark is relatively low, not rising above 82% coherence. This reinforces the difficulty of the tasks (at least when posed to crowdsourced teams) and also raises questions about the quality of Kialo as a source of training data. Figure 1 includes two examples of generated model outputs for the s u g g e s t - o b j e c t i o n s task. The complete set of generated examples along with their ratings can be found in the project GitHub repository7 . From one perspective, coherence is a low bar. A suggestion can be coherent without being new, true, important, or eloquent. On the other hand, coherence is a significant milestone, revealing an ability to abide by conventional rules of logical argumentation. The observed coherence rates were achieved in a model that is at least two orders of magnitude smaller than commercial models that are state-of-the-art [39], with limited exploration of the space of optimization regimes. It is foreseeable that with larger language models and more dedicated 7 Available at https://github.com/Hunt-Laboratory/language-model-optimization. Claim: Cultural appropriation is wrong. Objections: • Integrating different cultures is one of the main way for cultures to develop themselves. • People who learn other cultures will become more tolerant, open minded and open to new experiences. (a) Example of model output that was rated “Coherent +”. Claim: Private schools preserve traditions that are absent, or otherwise impractical to maintain, in the state system. Objections: • Given this has never been tried, and no examples are given, there is no reason to assume these traditions cannot be moved to a public system. • Not all of these traditions are good, and many can perpetuate socio-economic divides far beyond the school system, for example by creating ‘old boys clubs’. • This is a ‘we must preserve these traditions for our daughters’, rather than a ‘it is to our children’ argument. (b) Example of model output that was rated “Incoherent −”. Figure 1: Two randomly selected examples of model-generated outputs for the s u g g e s t - o b j e c t i o n s task. We present the generated model output (in bold) and the context on which the model was conditioned in a form abstracted from the particular prompt templates used. Example (a) was generated by GPT Neo with an optimized soft prompt, example (b) was generated by unfinetuned GPT Neo with a zero-shot prompt. effort, an automated approach to argumentative reasoning tasks could reach coherence on par with that of Kialo users. 5. Conclusion Recently, large language models have come to dominate the field of natural language processing, but arguably remain underexplored in the computational argumentation literature. In this paper, we systematically evaluated the performance of a 2.7 billion parameter pretrained language model across 6 argumentative reasoning tasks, using 5 different optimization strategies. With a few exceptions, the rate at which the models produced coherent responses ranged from 15-50%, compared to human performance of 65-82%. We share our finetuned models and code. To our knowledge the language model studied is larger than those previously considered in the argumentation literature, but it has at least two orders of magnitude fewer parameters than those that are state of the art on other NLP tasks, and the labeled data used for finetuning was of dubious quality. Natural next steps would be to evaluate the performance of much larger pretrained language models on the same argumentative reasoning tasks, and to invest in the development of larger, high-quality labeled datasets of natural language argumentation to use for finetuning. That said, language models fundamentally model statistical—rather than logical—relationships between words, and it is not clear whether bigger models and better data alone will be sufficient to produce reliably coherent results. Thus, it would be valuable to explore how language models could be combined with symbolic argumentation methods to improve the coherence of generated arguments. Table 4 Perplexity scores on the test set, calculated separately for the full text (prompt and repsonse), and the response only. Lower perplexity scores are preferred, indicating that the test was assigned a higher likelihood by the language model. Note that s u g g e s t - c o p r e m i s e and s u g g e s t - a b s t r a c t i o n were not evaluated because there was no human data to reference against, and s u g g e s t - i n t e r m e d i a r y - c l a i m s was not evaluated for the zero-shot optimization method because the output required particular formatting conventions which could not be plausibly conveyed in a zero-shot prompt (i.e. without concrete examples). G G G G G PT PT PT PT PT -N -N -N -N -N eo eo eo eo eo (z (fe (so (b (a er l ia lp w ft o- sp -s ar pr sh ho ar am om ot t) am ) et pt et er tu er st ne s un tu d ) ed ne ) d ) Full Text suggest-reasons 20.62 10.03 9.99 14.01 14.59 suggest-objections 20.10 10.39 9.88 12.91 13.42 suggest-conclusion 23.31 10.51 8.67 13.50 13.89 suggest-intermediary-claims 8.15 4.69 5.89 5.96 suggest-copremise suggest-abstraction Average 21.35 9.77 8.31 11.58 11.96 Response Only suggest-reasons 203.37 146.91 378.24 754.77 298.97 suggest-objections 363.26 68.28 260.71 1375.35 757.49 suggest-conclusion 355.20 223.70 223.81 936.33 867.84 suggest-intermediary-claims 82.43 63.65 5260.00 635.38 suggest-copremise suggest-abstraction Average 307.28 130.33 231.60 2081.61 639.92 Table 5 Manual coherence ratings for samples of 100 examples from each test set (rubric in Table 3). The same explanation of the empty cells that is given in the caption of Table 4 holds, with the additional note that we did not manually evaluate the “few-shot, no tuning” strategy for the s u g g e s t - r e a s o n s , s u g g e s t - o b j e c t i o n s , and s u g g e s t - c o n c l u s i o n tasks. This was a strategic decision due to limited funds available for rating, and based on prior research indicating that well-formed zero-shot prompts had outperformed few-shot prompts in similar contexts. Given that the perplexity scores were lower for few-shot prompts than for zero-shot prompts across the response tokens, it would be interesting to perform this manual evaluation in future. G G G G G H PT PT PT PT PT um -N -N -N -N -N an eo eo eo eo eo (z (fe (so ( (a bi er ll w a ft o- pa sp -s pr sh ho ra ar om ot m t) am ) et pt et er tu er st ne st u d) ne un d) ed ) Coherence (%) suggest-reasons 47.0 41.0 53.0 39.0 80.0 suggest-objections 37.0 33.0 46.0 33.0 82.0 suggest-conclusion 18.0 21.0 26.0 32.0 65.0 suggest-intermediary-claims 16.0 3.0 15.0 24.0 75.0 suggest-copremise 25.0 suggest-abstraction 18.0 Average 34.0 19.7 24.5 35.0 32.0 75.5 Coherence (mean) suggest-reasons 2.26 2.17 2.46 2.13 3.36 suggest-objections 2.03 1.88 2.24 1.93 3.31 suggest-conclusion 1.51 1.75 1.75 1.90 2.82 suggest-intermediary-claims 1.54 1.15 1.56 1.94 3.17 suggest-copremise 1.71 suggest-abstraction 1.54 Average 1.93 1.60 1.74 2.00 1.98 3.17 Acknowledgments This research was funded by the Australian Department of Defence and the Office of National Intelligence under the AI for Decision Making Program, delivered in partnership with the Defence Science Institute in Victoria. The authors would like to thank other members of the Hunt Lab, particularly Tim van Gelder and Ashley Barnett for helpful discussions, and anonymous reviewers for their constructive feedback. References [1] D. C. Engelbart, Augmenting Human Intellect: A Conceptual Framework, Summary Report Project #3578, Stanford Research Institute, 1962. URL: https://apps.dtic.mil/sti/pdfs/ AD0289565.pdf. [2] S. Black, L. Gao, P. Wang, C. Leahy, S. Biderman, GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, Zenodo, 2021. doi:1 0 . 5 2 8 1 / z e n o d o . 5 2 9 7 7 1 5 . [3] G. Branwen, GPT-3 Creative Fiction, 2020. URL: https://www.gwern.net/GPT-3. [4] G. Branwen, GPT-3 Nonfiction, 2020. URL: https://www.gwern.net/GPT-3-nonfiction. [5] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language Models are Few-Shot Learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper/ 2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [6] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le, Finetuned Language Models Are Zero-Shot Learners, CoRR (2021). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 1 0 9 . 0 1 6 5 2 . [7] E. Perez, D. Kiela, K. Cho, True Few-Shot Learning with Language Models, in: M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, J. W. Vaughan (Eds.), Advances in Neural Information Processing Systems, volume 34, Curran Associates, Inc., 2021, pp. 11054–11070. URL: https: //proceedings.neurips.cc/paper/2021/file/5c04925674920eb58467fb52ce4ef728-Paper.pdf. [8] T. Schick, H. Schütze, True Few-Shot Learning with Prompts—A Real-World Perspective, Transactions of the Association for Computational Linguistics 10 (2022) 716–731. doi:1 0 . 1162/tacl_a_00485. [9] R. L. Logan, I. Balazevic, E. Wallace, F. Petroni, S. Singh, S. Riedel, Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models, CoRR (2021). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 1 0 6 . 1 3 3 5 3 . [10] S. Wang, H. Fang, M. Khabsa, H. Mao, H. Ma, Entailment as Few-Shot Learner, CoRR (2021). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 1 0 4 . 1 4 6 9 0 . [11] R. El Baff, H. Wachsmuth, K. Al Khatib, M. Stede, B. Stein, Computational Argumentation Synthesis as a Language Modeling Task, in: Proceedings of the 12th International Confer- ence on Natural Language Generation, Association for Computational Linguistics, Tokyo, Japan, 2019, pp. 54–64. doi:1 0 . 1 8 6 5 3 / v 1 / W 1 9 - 8 6 0 7 . [12] P. Clark, O. Tafjord, K. Richardson, Transformers as Soft Reasoners over Language, CoRR (2020). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 0 0 2 . 0 5 8 6 7 . [13] T. Gurcke, M. Alshomary, H. Wachsmuth, Assessing the Sufficiency of Arguments through Conclusion Generation, CoRR (2021). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 1 1 0 . 1 3 4 9 5 . [14] G. Skitalinskaya, J. Klaff, H. Wachsmuth, Learning From Revisions: Quality Assessment of Claims in Argumentation at Scale, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics, Online, 2021, pp. 1718–1729. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . e a c l - m a i n . 147. [15] Z. Jin, A. Lalwani, T. Vaidhya, X. Shen, Y. Ding, Z. Lyu, M. Sachan, R. Mihalcea, B. Schölkopf, Logical Fallacy Detection, 2022. doi:1 0 . 4 8 5 5 0 / a r X i v . 2 2 0 2 . 1 3 7 5 8 . [16] Ought Inc., Elicit, 2022. URL: https://elicit.org/, accessed: 2022-04-29. [17] N. Slonim, Project Debator, in: Computational Models of Argument: Proceedings of COMMA 2018, 2018, p. 4. doi:1 0 . 3 2 3 3 / 9 7 8 - 1 - 6 1 4 9 9 - 9 0 6 - 5 - 4 . [18] N. Slonim, Y. Bilu, C. Alzate, R. Bar-Haim, B. Bogin, F. Bonin, L. Choshen, E. Cohen-Karlik, L. Dankin, L. Edelstein, L. Ein-Dor, R. Friedman-Melamed, A. Gavron, A. Gera, M. Gleize, S. Gretz, D. Gutfreund, A. Halfon, D. Hershcovich, R. Hoory, Y. Hou, S. Hummel, M. Jacovi, C. Jochim, Y. Kantor, Y. Katz, D. Konopnicki, Z. Kons, L. Kotlerman, D. Krieger, D. Lahav, T. Lavee, R. Levy, N. Liberman, Y. Mass, A. Menczel, S. Mirkin, G. Moshkowich, S. Ofek- Koifman, M. Orbach, E. Rabinovich, R. Rinott, S. Shechtman, D. Sheinwald, E. Shnarch, I. Shnayderman, A. Soffer, A. Spector, B. Sznajder, A. Toledo, O. Toledo-Ronen, E. Venezian, R. Aharonov, An Autonomous Debating System, Nature 591 (2021) 379–384. doi:1 0 . 1 0 3 8 / s41586- 021- 03215- w. [19] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, N. Fiedel, PaLM: Scaling Language Modeling with Pathways, 2022. doi:1 0 . 4 8 5 5 0 / a r X i v . 2 2 0 4 . 0 2 3 1 1 . [20] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improving Language Understanding by Generative Pre-Training, Technical Report, OpenAI, 2018. [21] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language Models Are Unsupervised Multitask Learners, Technical Report, OpenAI, 2019. URL: http://www. persagen.com/files/misc/radford2019language.pdf. [22] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, C. Leahy, The Pile: An 800GB Dataset of Diverse Text for Language Modeling, CoRR (2021). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 1 0 1 . 0 0 0 2 7 . [23] K. Marriott, P. Sbarski, T. van Gelder, D. Prager, A. Bulka, Hi-Trees and Their Layout, IEEE Transactions on Visualization and Computer Graphics 17 (2011) 290–304. doi:1 0 . 1 1 0 9 / TVCG.2010.45. [24] T. van Gelder, P. Monk, Argument Mapping Short Course, 2017. [25] M. Lenz, P. Sahitaj, S. Kallenberg, C. Coors, L. Dumani, R. Schenkel, R. Bergmann, Towards an Argument Mining Pipeline Transforming Texts to Argument Graphs, in: Computational Models of Argument: Proceedings of COMMA 2020, 2020, pp. 263–270. doi:1 0 . 3 2 3 3 / FAIA200510. [26] T. Gao, Prompting: Better Ways of Using Language Models for NLP Tasks, The Gradient (2021). URL: https://thegradient.pub/prompting/. [27] T. L. Scao, A. M. Rush, How Many Data Points is a Prompt Worth?, CoRR (2021). doi:1 0 . 48550/arXiv.2103.08493. [28] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, W. Chen, What Makes Good In-Context Examples for GPT-3?, CoRR (2021). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 1 0 1 . 0 6 8 0 4 . [29] Y. Lu, M. Bartolo, A. Moore, S. Riedel, P. Stenetorp, Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity, CoRR (2021). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 1 0 4 . 0 8 7 8 6 . [30] R. Zhong, K. Lee, Z. Zhang, D. Klein, Meta-tuning Language Models to Answer Prompts Better, CoRR (2021). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 1 0 4 . 0 4 6 7 0 . [31] B. Lester, R. Al-Rfou, N. Constant, The Power of Scale for Parameter-Efficient Prompt Tuning, CoRR (2021). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 1 0 4 . 0 8 6 9 1 . [32] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, J. Tang, GPT Understands, Too, CoRR (2021). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 1 0 3 . 1 0 3 8 5 . [33] G. Qin, J. Eisner, Learning How to Ask: Querying LMs with Mixtures of Soft Prompts, CoRR (2021). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 1 0 4 . 0 6 5 9 9 . [34] E. B. Zaken, S. Ravfogel, Y. Goldberg, BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models, CoRR (2021). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 1 0 6 . 10199. [35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: An Imperative Style, High-Performance Deep Learning Library, in: H. Wallach, H. Larochelle, A. Beygelz- imer, F. dAlché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems 32, Curran Associates, Inc., 2019, pp. 8024–8035. [36] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art Natural Language Processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. [37] K. Parker, soft-prompt-tuning, 2021. URL: https://github.com/kipgparker/ soft-prompt-tuning, accessed: 2021-11-20. [38] J. Rasley, S. Rajbhandari, O. Ruwase, Y. He, DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery, New York, NY, USA, 2020, pp. 3505–3506. [39] W. Fedus, B. Zoph, N. Shazeer, Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, Journal of Machine Learning Research 23 (2022) 1–39. URL: http://jmlr.org/papers/v23/21-0998.html.