Good Premises Retrieval via a Two-Stage Argument Retrieval Model Lorik Dumani Trier University dumani@uni-trier.de ABSTRACT debatewise.org. One challenge in premises retrieval is the Computational argumentation is an emerging research area. small textual overlap between query claim and good pre- An argument consists of a claim that is supported or atta- mises supporting or attacking. In this paper we present a cked by at least one premise. Its intention is the persuasion two-stage argument retrieval model. In contrast to existing of others to a certain standpoint. An important problem in methods like [15] which often use a combination of claim this field is the retrieval of good premises for a given claim and premise as a retrieval unit, we argue that a more pro- from a corpus of arguments. Given a claim, a first step of mising and principled approach than directly querying for existing approaches is often to find other claims that are premises is a two-stage process that first retrieves, given a textually similar. Then, the similar claim’s premises can be query claim, matching claims from the argument collection, retrieved. This paper presents a research plan for an imple- and then considers their premises only. Then, instead of re- mentation of a two-stage argument retrieval model that first trieving single premises we aim to cluster similar premises finds similar claims for a given query claim and then in the and to retrieve ranked clusters of premises. next step retrieves clusters of similar premises in a ranked For the remainder of this paper Section 2 provides an over- order. view of fundamentals such as an introduction to the related project ReCAP, and the common definition of arguments and argumentation. In Section 3 we present our research 1. INTRODUCTION plan to retrieve clusters of premises for a query claim. Sec- Argumentation exists probably as long as humans com- tion 4 describes our evaluation plan and Section 5 serves municate but research on computational argumentation has with some results we found. Section 6 provides an overview only recently become popular. In its simplest case an argu- of related work and Section 7 concludes the paper with some ment consists of a claim or a standpoint that is supported future works. or attacked by at least one premise [10]. These relations between claims and premises can be expressed by argument graphs. The purpose of argumentation is the persuasion of 2. FUNDAMENTALS others towards a certain standpoint. Since premises can in This section introduces this work’s related project ReCAP turn be attacked or supported, often large argument net- as well as the common definition of arguments and argumen- works emerge for a major claim [10]. tation. Our ultimate goal is, to support users arguing for or against a topic by providing the best premises to similar topics in 2.1 Project Context a ranked order by convincingness, trustworthiness or user This work is part of the ReCAP project described in [1] context. There already exist argument search engines like which is part of the DFG priority program robust argumen- args1 or ArgumenText2 that take a claim as input and tation machines (RATIO)3 . return a list of premises that support or attack the query ReCAP is an acronym for Information Retrieval and Case- claim. These systems usually work on precomputed argu- Based Reasoning for Robust Deliberation and Synthesis of ment graphs that were either mined from texts or extrac- Arguments in the Political Discourse. The ReCAP project ted from dedicated argument websites like idebate.org or follows the vision of future argumentation machines that support researchers, journalistic writers, as well as human 1 www.args.me decision makers to obtain a comprehensive overview of cur- 2 www.argumentsearch.com rent arguments and opinions related to a certain topic. Fur- thermore, it aims to develop personal and well-founded opi- nions that are justified by convincing arguments. While exis- ting search engines are limited to achieve this approach, sin- ce they primarily operate on the textual level, such argumen- tation machines will reason on the knowledge level formed by arguments and argumentation structures. In [1] we pro- pose a general architecture for an argumentation machine 31st GI-Workshop on Foundations of Databases (Grundlagen von Daten- with focus on novel contributions to and confluence of me- banken), 11.06.2019 - 14.06.2019, Saarburg, Germany. 3 Copyright is held by the author/owner(s). www.spp-ratio.de thods from Information Retrieval (IR) and Knowledge Re- presentation and Reasoning (RI), in particular Case-Based C: We should build new Reasoning. Deliberation finds and weighs all arguments sup- nuclear power plants porting or opposing some question or topic based on the available knowledge, e.g. by assessing their strength or fac- tual correctness, to enable informed decision making, e.g. for supports attacks a political action. Synthesis tries to generate new arguments for an upcoming topic based on transferring an existing re- levant argument to the new topic and adapting it to the new environment. p3: p1: Building nuclear plants This paper contributes to the retrieval of arguments, more Nuclear energy will reduce endangers the specifically to the retrieval of clusters of the best premises oil dependency environment in a ranked order for a given query claim from a corpus of arguments. supports 2.2 Argumentation Argumentation is omnipresent and exists probably as long as humans communicate with each other and research on p2: argumentation was already been studied by Aristotle more Expert E states that than 2,300 year ago [6]. By definition, an argument consists nuclear energy will reduce oil dependency of a claim or standpoint supported or opposed by reasons or premises [10]. The terms claim and premise can be subsumed under the term argument units [3]. As shown in Figure 1 relations between claims and premi- Figure 1: Simple argument graph showing the rela- ses can be expressed by argument graphs. The main claim tions between argument units. in a graph is called major claim [13] and since premises can in turn be attacked or supported, often large argument net- works emerge for a major claim [10]. As Figure 1 suggests, an argument unit such as p1 can also be used as a premise 3.1 Two-stage Retrieval Process to support another claim. Our ultimate goal is the retrieval of good premises sup- In this example the premises support or attack the claim porting and attacking a given query claim or, more general, but the kind of support or attack is not further specified. related to a query topic. Such a query could be a full sen- However, supports can be specified with so-called inference tence like e.g. “Find arguments to abandon nuclear energy” schemes [17]. Those schemes are templates for argumenta- or just consist of relevant terms such as “abandon nuclear tion that consist of claims and premises that are enriched energy”. One major challenge in the retrieval of premises is with descriptors that assign different roles to different argu- that a good, convincing, and related premise to the query ment components to ease the choice of the correct scheme. does not necessarily need to have much textual overlap. This Following [17], the support for the inference p1 → C in this can be illustrated with the premise “wind and solar energy example can be specified as “positive consequence”. The de- can already provide most of the energy we need ” for the up- scriptor for the premise in this scheme is “If A is brought per query claims. A less good premise could be “I don’t like about, good consequences will plausibly occur ”. We can in- nuclear energy. I would abandon it”. It is evident that the terpret a reduce in oil dependency as a good consequence. former premise only overlaps in the rather general term “en- The descriptor for the claim in this scheme is “A should be ergy” but is more convincing than the latter premise which brought about”. The variable A in the descriptor can be re- however overlaps in the three words “abandon”, “nuclear ”, placed with the demand to build new nuclear plants. In con- and “energy”. trast to supporting relations, there is no standard for the Since arguments consist of claims and premises, the pre- specification of attacking relations in argumentation theory mises are directly tied to the claim, so we can tackle this yet. problem by using a two-stage retrieval process that first re- Wachsmuth et al. provide in [16] a collection of approa- trieves, given a query claim, matching claims from the ar- ches in literature to measure argument quality in natural gument collection, and then considers their premises only. language. Furthermore, they define a taxonomy of dimensi- In the first step we only search for similar claims to the ons to measure. The dimensions of argument quality can be user’s query claim, i.e., ignoring the premises at this point divided into the three dimensions logical quality in terms of of time. Then in the second step we cluster similar premises the cogency or strength of an argument, rhetorical quality in and retrieve them in a ranked order. terms of the persuasive effect of an argument or argumenta- tion, and dialectic quality in terms of the reasonableness of 3.2 The First Stage argumentation for resolving issues [16]. In order to find relevant claims to a query claim we need to find claims that are semantically similar to the query claim. More precise, we need to find claims that have relevant pre- 3. RESEARCH PLAN mises to the query. So the challenge is to use basically syn- This section illustrates the research plan for implementing tactic similarity to achieve semantic similarity. In order to the two-stage retrieval system. We explain the necessity of estimate the probability that a claim is relevant to the que- the two stages and challenges we expect. ry, we can use any similarity measure we identify for textual Result Claim 1 Query Claim ranking can incorporate factual correctness, convincingness, but also user context such as prior knowledge or belief in ex- pert opinions, assumptions, and preferences. Therefore, we will include quality measures such as those described in [16]. However, we need to investigate in the strength of the clus- ter of premises. So far, there are only a few works in the Premises Result early stages of development concerning the quality of single Cluster 1 Claim 2 premises [16] but not clusters of premises. 3.4 Further Challenges Another problem that should be paid attention to is the premise’s stance, i.e., whether the premise supports or at- Premises Result Cluster 2 Claim 3 tacks the claim. But also the claim’s stance needs to be determined. Consider e.g. the query claim “Nuclear energy should be abolished ” and the claim “Nuclear energy should not be abolished ”. These claims take different views but have a high textual similarity which is why probably many retrie- val methods would output a high similarity. Still the premi- Premises Cluster 3 ses can not be adopted automatically. Moreover, claims often do not have a stance if they are queries like “Should nuclear energy be extended? ” or consist only of terms like “Nuclear Figure 2: From a query to similar claims to clusters energy”. One legit possibility for claims with neutral stan- of premises. ces is to treat them as implicitly positive. Then, if a query claim and a result claim have the same stance, a premise that supports the result claim also supports the query claim data such as a plain language model, possibly with additio- whereas if the query claim and the result claim have oppo- nal smoothing and taking the textual context of the claim site stances, a premise that supports the claim will attack into account. the query claim and vice versa. Another approach that could make sense is to normalize stances of claims, i.e., to try to 3.3 The Second Stage have only “positive” claims. Alternatively we could revert Since we are searching for good premises for a query claim support and attack for negative claims. Still, that could be that are obtained from similar claims to the query claim, we difficult if stance is not fully clear. Nevertheless, there exist can assume that similar claims often have similar premi- algorithms for stance detection [12] which we can then use ses. Furthermore, as we are working with a large corpus of for this purpose. arguments, we will find a lot of similar premises, probably Consider Figure 1 again. As already stated a claim can be from semantically completely different claims. So instead of used as premise to support or attack another claim. In this searching for single premises we group similar premises and instance, the premise p2 “Expert E states that nuclear energy search for clusters of premises. For clustering all premises we will reduce oil dependency” is used to support the argument can first convert all premises with the same stance into em- unit p1 “Nuclear energy will reduce oil dependency” which in bedding vectors and then perform a hierarchical clustering. turn is used as premise to support the claim C “We should Instead of computing own models to get embedding vectors build new nuclear power plants”. We need to investigate in we can make use of existing models such as the Universal the transitivity of inferences. In the example in Figure 1 to Sentence Encoder described in [2]. We can use the Euclidean which extent e.g. p2 is supportive for C. Analogously to that distance to compute distances between vectors. Clustering we need to investigate in the case whether a premise is sup- can be accomplished with agglomerative clustering, which portive to a claim if the premise attacks another premise is a bottom-up approach. Since we prefer smaller clusters to that in turn attacks the claim. Assume there would be a keep the number of false positives per cluster to a minimum, premise p4 “Humans endanger the environment either way” complete linkage is a good way to connect clusters [9]. that attacks premise p3 “Building nuclear plants endangers Figure 2 visualizes an example of the relation between a the environment” which in turn already attacks claim C in query, similar claims, and clusters of similar premises. Here, Figure 1. So we want to examine how supportive premises we have to answer the research question how often premises such as p4 are generally to a claim. In [15] Wachsmuth et which are similar to a premise do appear in claims that are al. simply adopt these as own premises for the claim. Howe- similar to the query claim. In order to estimate the probabili- ver, we will investigate whether a partial score or a damping ty that a premise cluster should be chosen as supportive for a factor yields better results. Since we are working with clus- claim, we can use a simple approach as a frequency-styled ar- ters of premises we can select one premise as representative. gument, i.e., we need to count how frequently a premise clus- This could, for example, be the premise most similar to the ter from this claim supports similar claims in a large corpus. centroid in the cluster. Please remember that premises are Besides that, we can also consider to include inverse docu- converted to embedding vectors to compute the clusters. ment frequency-styled arguments, i.e., we need to count how So far we have considered less complex queries such as frequently the premise cluster was used as support or attack “what are good reasons for nuclear energy”. A query howe- for other claims in a large corpus. Other legit approaches ver can be much more complicated e.g. by the use of cons- are to include estimates on truthfulness, appropriateness (of traints. Such a more complex query could be “what are com- the premise for the claim), and confidence in expert. The mon statements with factual evidences of Expert E in the last three months that nuclear energy is a viable option in claim. For a subset of our query claims, we will build a pool Germany”. In this example a user demands factual evidences of all result premises in the top-k (for some k ∈ N) of all for a geographically restricted area of a certain expert for a result lists and let annotators assess the premises’ relevan- certain topic in a certain time span. Furthermore, the con- ce as explained above. In addition to that, we can conduct text could be desired to be restricted to opinions by certain a user study with more participants to overcome possible interest groups or parties with certain political orientation shortcoming of having only few annotators to check the re- such as left-wing parties. An approach could be to divide sults. By the use of nDCG at different cutoffs, averaged over complex queries into sub queries. If the query is expressed all queries, we can evaluate different retrieval methods for as a coherent sentence its tree can be derived by the use this end-to-end analysis. of Part-of-Speech implementations such as [14]. Then, a cut can deliver useful sub queries. 5. PRELIMINARY RESULTS In this section we give an overview of results we found 4. EVALUATION PLAN so far by investigating the stages of the two-stage retrieval Instead of creating argument collections which is a very model. First we describe how we built our dataset consisting time consuming task or automatically mine arguments from of arguments, then we describe the first, and then the second natural language texts which might be noisy we will adapt step of the two-step retrieval process. the idea of [15] and make use of several debate portals. In The dataset described in [15] is not publicly available, fact we use idebate.org, debatewise.org, debatepedia. therefore we reconstructed a similar dataset following the org, and debate.org as starting point. While the first three approach in that paper. We crawled the arguments from are of high quality, the latter is of lower quality, i.e., some four debate portals, namely debate.org, debatepedia.org, few premises consist of insults or nonsense. However, the debatewise.org, and idebate.org. After the arguments we- latter contains much more debates as the other three toge- re extracted, they were indexed with Apache Lucene. In the ther. We expect this constellation to result in good diversi- end, this resulted in overall 59,126 claims with 695,818 pre- fication. The constructions of debate portals already serve mises, so on average about 11.8 premises per claim. with argument structures. One questioner asks the commu- We now describe the first step of the two-step retrieval nity about a topic, e.g., “Should we build new nuclear power process. Since real-life query inputs of users are difficult to plants”. Then users of the community can directly answer find, we drew a random sample of 233 claims and used them the questions and substantiate their posts e.g. with facts or as queries. In order to avoid claims that address completely examples. Many debate portals also provide the possibility random topics, our sample contained only claims that are of adding a stance for or against to an answer, as do the por- related to the topic “energy”. To do so, we trained a word- tals we have selected for our study. The main advantages of embedding-model on the 59,126 claims of our corpus using debate portals are that the posts are not artificial but close DeepLearning4j4 . Then, we retrieved the nearest words of to reality. Besides that they are coherent. Following [15] we the word energy and filtered out inappropriate suggestions. use the debate portals’ queries as claims and their answers Inappropriate suggestions were those that had nothing in as premises to build arguments. common with our topic energy in the broadest sense. We We can divide the evaluation of the two-stage retrieval repeated this approach five times for all newly added sug- process into two evaluation steps. First we want to find si- gestions. In the end, we obtained 44 words such as “nuclear ”, milar claims to a query claim. This can be achieved via an “electricity”, “wind ”, “solar ”, “oil ”, “emission”, etc. We got existing textual similarity method. In order to decide which 1,529 candidate claims where at least one of these words oc- similarity method is suitable we can take a small number n of curred, from which we drew a random sample of 233 claims, query claims and build pools of depth k by a union of result making sure by manual inspection that they are really re- claims of existing similarity methods. Then, annotators can lated to the topic energy. To ensure that we end up with manually assess the similarity of each (query claim, result at least 200 valid claims, we have added another 33. In the claim) pair e.g. in the range between 1 (nothing in common) end, we removed one claim because it appeared twice. We and 5 (semantically equal). The question which similarity considered 196 different retrieval methods5 implemented in method should be adopted for the retrieval of claims can be Apache Lucene and retrieved, for each method, result claims shifted to the question which method’s ranking comes clo- for our 232 query claims. From the results, we built pools sest to the annotations. We will use state-of-the-art ranking of depth 5, i.e., including any claim that appeared in the measures such as nDCG [8] for the evaluation of rankings. result list of any method at rank 5 or better. This resulted After we determined the most similar claims to a query in 5171 (query claim, result claim) pairs. Please note, that claim we want to retrieve their directly tied (clusters of) pairs where the result claim was equal to the query claim premises. In order to validate the hypothesis that claims are already excluded. highly similar to the query claim also have premises that are highly relevant for the query claim, we can take a fix number 4 Among others we used SkipGram as learning algorithm, the of (query claim, result claim, result claim premise) pairs of maximum window size was 8, the word vector size was 1000, different similarities and let annotators manually assess the the text was not preprocessed, and the number of iterations pairs e.g. on a binary scale where the annotators are not over the whole corpus was 15. 5 aware of the actual result claim. The higher the similarity Apache Lucene (Version 7.6.0) provides 139 similarity me- thods as well as a class for multiple similarities. We tested is between two claims the more relevant the one’s premises all combinations of the best methods’ variants of Diver- should be to the other claim. gence from Randomness, Divergence from Independence, Furthermore, we need an end-to-end analysis to evaluate information-based models, and Axiomatic approaches as the overall performance of our premise retrieval approach, well ∑6 as(6BM25 ) and Jelinek-Mercer in a first run and got i.e., how well can our approach retrieve premises for a given k=2 k = 57 new methods, resulting in 196 methods. The user-perceived similarity of each (query claim, result claim) pair was independently assessed by at least two an- Table 1: Relevance levels for claim assessment score meaning notators on the scale from 1 to 5. A total of eight people participated in the annotation. They are all included in the 5 The claims are equal. ReCAP project and were introduced to the basics of ar- 4 The claims differ in polarity, but are otherwise equal. gumentation theory. Table 1 explains the meanings of the 3 The claims differ in specificity or extent. 2 The claims address the same topic, but are unrelated. different levels. The underlying assumption of this scale is 1 The claims are unrelated. that all premises of claims rated 4 or 5 should apply to the query claim, whereas no premises of claims rated 1 should apply. For claims rated 3, we expect that a good number of premises match, whereas premises of claims rated 2 would well at the claim retrieval task, it should also perform well at only rarely match. The annotators were confronted with the the subsequent premise retrieval task; the initial hypothesis query claim and a result claim and were asked to assess how is thus validated. well they expect the premises of the result claim (that were unknown to them) would match the query claim. Since we 6. RELATED WORK only wanted to measure the relevance of claims at this point, the actual premises were not considered at this point, but Wachsmuth et al. [15] introduce one of the first prototy- investigated later. Since polarity of premises is not in the fo- pes of an argument search engine called args. Their system cus of this study, we collapse the levels 4 and 5 into a single operates on arguments crawled from debate portals. Given a level 4 for this study. As every pair of query claim and result user query, the system retrieves, ranks, and presents premi- claim was assessed by at least two annotators, the final rele- ses supporting and attacking the query claim, taking simila- vance value of a result claim for a query claim was computed rity of the query claim with the premise, its corresponding as the mean value of the corresponding assessments. claim, and other contextual information into account. They Using the assessed pool of results as a gold standard, apply a standard BM25F ranking model implemented on top we evaluated the performance of the 196 retrieval methods of Lucene. In contrast to their system, we did not restrict under consideration for the claim retrieval task, using nD- ourself to BM25 or variants, but evaluated 196 different si- CG@k [8] with cutoff values k ∈ {1, 2, 5} as quality metric. milarity methods for claim retrieval. Our results clearly show that the BM25 [11] scoring method Stab et al. [12] present ArgumenText, an argument re- used in previous works is usually not a good choice, especi- trieval system capable of retrieving topic-relevant sentential ally for cutoff 5, which is a realistic cutoff for a system that arguments from a large collection of diverse Web texts for aims at finding the top-10 premises. In contrast to the me- any given controversial topic. The system first retrieves re- thod Divergence from Randomness (DFR) [7], which yielded levant documents, then it identifies arguments and classifies an nDCG@5 of 0.7982, BM25 yielded only 0.7616. them as “pro” or “con”, and presents them ranked by rele- We now focus on the second step of the two-stage retrie- vance in a web interface. In their implementation, they make val framework, retrieving the premises of claims similar to use of Elasticsearch and BM25 to retrieve the top-ranked the query claim. Our goal here is to verify the assumption documents. In contrast to this work, we do not consider made above that claims highly similar to the query claim the argument mining task, but assume that we operate on also have premises that are highly relevant for the query a collection of arguments with claims and premises. Howe- claim. To systematically approach this question, we formed ver, in another work Habernal and Gurevych [4] propose triples of the form (query claim, result claim, result premise) a semi-supervised model for argumentation mining of user- from the above-mentioned pool, where the result premise is generated Web content. a premise of the result claim. We grouped the triples ac- In [5], Habernal and Gurevych address the relevance of cording to the relevance of the result claim to the query premises to estimate the convincingness of arguments using claim, forming groups of the relevance ranges [n, n + 0.5) neural networks. Since relevance underlies a subjective jud- for n ∈ {n : 1 ≤ n ≤ 3.5} and [4, 4], which yielded seven gement they first confronted users in a crowdsourced task groups. Then, we randomly drew 100 (query claim, result with pairs of premises to decide which premise is more con- claim, result premise) triples from each group and had two vincing, and then used a bidrectional LSTM to predict which annotators manually assess the relevance of the result pre- argument is more convincing. Wachsmuth et al. [16] consider mise for the query claim (without seeing the result claim), the problem of judging the relevance of arguments and provi- resulting in 1400 assessments. Annotators could choose bet- de an overview of the work on computational argumentation ween either not relevant or relevant with three different stan- quality in natural language, including theories and approa- ces: query with neutral stance, premise with same stance as ches. Approaches that predict relevance or convincingness of query and premise with opposite stance as query. As we did premises can be useful to rank premises. with claims before, we ignore the stances of premises since we only want to focus on their relevance, and many claims 7. CONCLUSION AND FUTURE WORK of our dataset do not have a stance anyway. We thus con- Retrieving good premises for claims is an important, but sider only binary relevance for premises from now on. Our difficult problem for which no good solutions exist yet. This preliminary results support the observation that the more paper has provided some insights that a two-stage retrieval relevant a claim for the query is, the more relevant premises process that first retrieves claims, and then ranks their clus- it yields. For example, 80 % of the premises of the result tered premises can be a step towards a solution. The best claim in interval [4, 4] were relevant to the query claim. In premises are found for the most similar claims, according comparison, only 6 % of the premises in interval [1, 1.5] were to assessments by human annotators, is already good. We relevant to the query claim. So if a search engine performs showed that, instead of exhaustively assessing all retrieved premises for a claim, it is sufficient to assess only the retrie- Denmark, September 9-11, 2017 - System ved claims, which is an order of magnitude less work. Demonstrations, pages 7–12, 2017. Our future work will include ranking methods for premi- [7] S. P. Harter. A probabilistic approach to automatic ses. We will also examine additional quality-based premise keyword indexing. JASIS, 26(4):197–206, 1975. features [16] such as convincingness or correctness. We plan [8] K. Järvelin and J. Kekäläinen. Cumulated gain-based for a public Web application as an interface to our premise evaluation of IR techniques. ACM Trans. Inf. Syst., retrieval system. 20(4):422–446, 2002. We will also tackle the task to detect stances. Although [9] G. N. Lance and W. T. Williams. Mixed-data debate portals ask users to add stances to the premises, these classificatory programs I - agglomerative systems. stances are related to the claim, but the claims’ stances are Australian Computer Journal, 1(1):15–20, 1967. not further specified. Hence, premises that support a claim [10] A. Peldszus and M. Stede. From argument diagrams may attack a claim with an opposite stance and vice versa. to argumentation mining in texts: A survey. IJCINI, 7(1):1–31, 2013. 8. ACKNOWLEDGMENTS [11] S. E. Robertson and H. Zaragoza. The probabilistic I would like to thank my supervisor Ralf Schenkel for his relevance framework: BM25 and beyond. Foundations invaluable help in creating this paper. and Trends in Information Retrieval, 3(4):333–389, This work has been funded by the Deutsche Forschungsge- 2009. meinschaft (DFG) within the project ReCAP, Grant Num- [12] C. Stab, J. Daxenberger, C. Stahlhut, T. Miller, ber 375342983 - 2018-2020, as part of the Priority Program B. Schiller, C. Tauchmann, S. Eger, and I. Gurevych. ”Robust Argumentation Machines (RATIO)” (SPP-1999). Argumentext: Searching for arguments in heterogeneous sources. In Proceedings of the 2018 Conference of the North American Chapter of the 9. REFERENCES Association for Computational Linguistics, [1] R. Bergmann, R. Schenkel, L. Dumani, and NAACL-HLT 2018, New Orleans, Louisiana, USA, S. Ollinger. Recap - information retrieval and June 2-4, 2018, Demonstrations, pages 21–25, 2018. case-based reasoning for robust deliberation and [13] C. Stab, C. Kirschner, J. Eckle-Kohler, and synthesis of arguments in the political discourse. In I. Gurevych. Argumentation mining in persuasive Proceedings of the Conference ”Lernen, Wissen, essays and scientific articles from the discourse Daten, Analysen”, LWDA 2018, Mannheim, Germany, structure perspective. In Proceedings of the Workshop August 22-24, 2018., pages 49–60, 2018. on Frontiers and Connections between Argumentation [2] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. Theory and Natural Language Processing, John, N. Constant, M. Guajardo-Cespedes, S. Yuan, Forlı̀-Cesena, Italy, July 21-25, 2014., 2014. C. Tar, B. Strope, and R. Kurzweil. Universal [14] K. Toutanova, D. Klein, C. D. Manning, and sentence encoder for english. In Proceedings of the Y. Singer. Feature-rich part-of-speech tagging with a 2018 Conference on Empirical Methods in Natural cyclic dependency network. In Human Language Language Processing, EMNLP 2018: System Technology Conference of the North American Chapter Demonstrations, Brussels, Belgium, October 31 - of the Association for Computational Linguistics, November 4, 2018, pages 169–174, 2018. HLT-NAACL 2003, Edmonton, Canada, May 27 - June 1, 2003, 2003. [3] J. Eckle-Kohler, R. Kluge, and I. Gurevych. On the role of discourse markers for discriminating claims and [15] H. Wachsmuth, M. Potthast, K. A. Khatib, Y. Ajjour, premises in argumentative discourse. In Proceedings of J. Puschmann, J. Qu, J. Dorsch, V. Morari, the 2015 Conference on Empirical Methods in Natural J. Bevendorff, and B. Stein. Building an argument Language Processing, EMNLP 2015, Lisbon, Portugal, search engine for the web. In Proceedings of the 4th September 17-21, 2015, pages 2236–2242, 2015. Workshop on Argument Mining, ArgMining@EMNLP [4] I. Habernal and I. Gurevych. Exploiting debate 2017, Copenhagen, Denmark, September 8, 2017, portals for semi-supervised argumentation mining in pages 49–59, 2017. user-generated web discourse. In Proceedings of the [16] H. Wachsmuth, B. Stein, G. Hirst, V. Prabhakaran, 2015 Conference on Empirical Methods in Natural Y. Bilu, Y. Hou, N. Naderi, and T. Alberdingk Thijm. Language Processing, EMNLP 2015, Lisbon, Portugal, Computational argumentation quality assessment in September 17-21, 2015, pages 2127–2137, 2015. natural language. In Proceedings of the 15th [5] I. Habernal and I. Gurevych. Which argument is more Conference of the European Chapter of the Association convincing? analyzing and predicting convincingness for Computational Linguistics, EACL 2017, Valencia, of web arguments using bidirectional LSTM. In Spain, April 3-7, 2017, Volume 1: Long Papers, pages 176–187, 2017. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, [17] D. Walton, C. Reed, and F. Macagno. Argumentation August 7-12, 2016, Berlin, Germany, Volume 1: Long Schemes. Cambridge University Press, 2008. Papers, 2016. [6] I. Habernal, R. Hannemann, C. Pollak, C. Klamm, P. Pauli, and I. Gurevych. Argotario: Computational argumentation meets serious games. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen,