=Paper= {{Paper |id=Vol-2367/paper_13 |storemode=property |title=Good Premises Retrieval via a Two-Stage Argument Retrieval Model |pdfUrl=https://ceur-ws.org/Vol-2367/paper_13.pdf |volume=Vol-2367 |authors=Lorik Dumani |dblpUrl=https://dblp.org/rec/conf/gvd/Dumani19 }} ==Good Premises Retrieval via a Two-Stage Argument Retrieval Model== https://ceur-ws.org/Vol-2367/paper_13.pdf
         Good Premises Retrieval via a Two-Stage Argument
                        Retrieval Model

                                                            Lorik Dumani
                                                             Trier University
                                                       dumani@uni-trier.de


ABSTRACT                                                               debatewise.org. One challenge in premises retrieval is the
Computational argumentation is an emerging research area.              small textual overlap between query claim and good pre-
An argument consists of a claim that is supported or atta-             mises supporting or attacking. In this paper we present a
cked by at least one premise. Its intention is the persuasion          two-stage argument retrieval model. In contrast to existing
of others to a certain standpoint. An important problem in             methods like [15] which often use a combination of claim
this field is the retrieval of good premises for a given claim         and premise as a retrieval unit, we argue that a more pro-
from a corpus of arguments. Given a claim, a first step of             mising and principled approach than directly querying for
existing approaches is often to find other claims that are             premises is a two-stage process that first retrieves, given a
textually similar. Then, the similar claim’s premises can be           query claim, matching claims from the argument collection,
retrieved. This paper presents a research plan for an imple-           and then considers their premises only. Then, instead of re-
mentation of a two-stage argument retrieval model that first           trieving single premises we aim to cluster similar premises
finds similar claims for a given query claim and then in the           and to retrieve ranked clusters of premises.
next step retrieves clusters of similar premises in a ranked              For the remainder of this paper Section 2 provides an over-
order.                                                                 view of fundamentals such as an introduction to the related
                                                                       project ReCAP, and the common definition of arguments
                                                                       and argumentation. In Section 3 we present our research
1.     INTRODUCTION                                                    plan to retrieve clusters of premises for a query claim. Sec-
   Argumentation exists probably as long as humans com-                tion 4 describes our evaluation plan and Section 5 serves
municate but research on computational argumentation has               with some results we found. Section 6 provides an overview
only recently become popular. In its simplest case an argu-            of related work and Section 7 concludes the paper with some
ment consists of a claim or a standpoint that is supported             future works.
or attacked by at least one premise [10]. These relations
between claims and premises can be expressed by argument
graphs. The purpose of argumentation is the persuasion of
                                                                       2. FUNDAMENTALS
others towards a certain standpoint. Since premises can in               This section introduces this work’s related project ReCAP
turn be attacked or supported, often large argument net-               as well as the common definition of arguments and argumen-
works emerge for a major claim [10].                                   tation.
   Our ultimate goal is, to support users arguing for or against
a topic by providing the best premises to similar topics in            2.1 Project Context
a ranked order by convincingness, trustworthiness or user                 This work is part of the ReCAP project described in [1]
context. There already exist argument search engines like              which is part of the DFG priority program robust argumen-
args1 or ArgumenText2 that take a claim as input and                   tation machines (RATIO)3 .
return a list of premises that support or attack the query                ReCAP is an acronym for Information Retrieval and Case-
claim. These systems usually work on precomputed argu-                 Based Reasoning for Robust Deliberation and Synthesis of
ment graphs that were either mined from texts or extrac-               Arguments in the Political Discourse. The ReCAP project
ted from dedicated argument websites like idebate.org or               follows the vision of future argumentation machines that
                                                                       support researchers, journalistic writers, as well as human
1
    www.args.me                                                        decision makers to obtain a comprehensive overview of cur-
2
    www.argumentsearch.com                                             rent arguments and opinions related to a certain topic. Fur-
                                                                       thermore, it aims to develop personal and well-founded opi-
                                                                       nions that are justified by convincing arguments. While exis-
                                                                       ting search engines are limited to achieve this approach, sin-
                                                                       ce they primarily operate on the textual level, such argumen-
                                                                       tation machines will reason on the knowledge level formed
                                                                       by arguments and argumentation structures. In [1] we pro-
                                                                       pose a general architecture for an argumentation machine
31st GI-Workshop on Foundations of Databases (Grundlagen von Daten-    with focus on novel contributions to and confluence of me-
banken), 11.06.2019 - 14.06.2019, Saarburg, Germany.                    3
Copyright is held by the author/owner(s).                                   www.spp-ratio.de
thods from Information Retrieval (IR) and Knowledge Re-
presentation and Reasoning (RI), in particular Case-Based                                               C:
                                                                                               We should build new
Reasoning. Deliberation finds and weighs all arguments sup-                                    nuclear power plants
porting or opposing some question or topic based on the
available knowledge, e.g. by assessing their strength or fac-
tual correctness, to enable informed decision making, e.g. for                      supports                          attacks
a political action. Synthesis tries to generate new arguments
for an upcoming topic based on transferring an existing re-
levant argument to the new topic and adapting it to the new
environment.                                                                                                                      p3:
                                                                            p1:
                                                                                                                        Building nuclear plants
   This paper contributes to the retrieval of arguments, more    Nuclear energy will reduce
                                                                                                                            endangers the
specifically to the retrieval of clusters of the best premises        oil dependency
                                                                                                                             environment
in a ranked order for a given query claim from a corpus of
arguments.
                                                                         supports
2.2   Argumentation
   Argumentation is omnipresent and exists probably as long
as humans communicate with each other and research on                       p2:
argumentation was already been studied by Aristotle more            Expert E states that
than 2,300 year ago [6]. By definition, an argument consists     nuclear energy will reduce
                                                                      oil dependency
of a claim or standpoint supported or opposed by reasons or
premises [10]. The terms claim and premise can be subsumed
under the term argument units [3].
   As shown in Figure 1 relations between claims and premi-      Figure 1: Simple argument graph showing the rela-
ses can be expressed by argument graphs. The main claim          tions between argument units.
in a graph is called major claim [13] and since premises can
in turn be attacked or supported, often large argument net-
works emerge for a major claim [10]. As Figure 1 suggests,
an argument unit such as p1 can also be used as a premise
                                                                 3.1 Two-stage Retrieval Process
to support another claim.                                           Our ultimate goal is the retrieval of good premises sup-
   In this example the premises support or attack the claim      porting and attacking a given query claim or, more general,
but the kind of support or attack is not further specified.      related to a query topic. Such a query could be a full sen-
However, supports can be specified with so-called inference      tence like e.g. “Find arguments to abandon nuclear energy”
schemes [17]. Those schemes are templates for argumenta-         or just consist of relevant terms such as “abandon nuclear
tion that consist of claims and premises that are enriched       energy”. One major challenge in the retrieval of premises is
with descriptors that assign different roles to different argu-    that a good, convincing, and related premise to the query
ment components to ease the choice of the correct scheme.        does not necessarily need to have much textual overlap. This
Following [17], the support for the inference p1 → C in this     can be illustrated with the premise “wind and solar energy
example can be specified as “positive consequence”. The de-      can already provide most of the energy we need ” for the up-
scriptor for the premise in this scheme is “If A is brought      per query claims. A less good premise could be “I don’t like
about, good consequences will plausibly occur ”. We can in-      nuclear energy. I would abandon it”. It is evident that the
terpret a reduce in oil dependency as a good consequence.        former premise only overlaps in the rather general term “en-
The descriptor for the claim in this scheme is “A should be      ergy” but is more convincing than the latter premise which
brought about”. The variable A in the descriptor can be re-      however overlaps in the three words “abandon”, “nuclear ”,
placed with the demand to build new nuclear plants. In con-      and “energy”.
trast to supporting relations, there is no standard for the         Since arguments consist of claims and premises, the pre-
specification of attacking relations in argumentation theory     mises are directly tied to the claim, so we can tackle this
yet.                                                             problem by using a two-stage retrieval process that first re-
   Wachsmuth et al. provide in [16] a collection of approa-      trieves, given a query claim, matching claims from the ar-
ches in literature to measure argument quality in natural        gument collection, and then considers their premises only.
language. Furthermore, they define a taxonomy of dimensi-        In the first step we only search for similar claims to the
ons to measure. The dimensions of argument quality can be        user’s query claim, i.e., ignoring the premises at this point
divided into the three dimensions logical quality in terms of    of time. Then in the second step we cluster similar premises
the cogency or strength of an argument, rhetorical quality in    and retrieve them in a ranked order.
terms of the persuasive effect of an argument or argumenta-
tion, and dialectic quality in terms of the reasonableness of    3.2 The First Stage
argumentation for resolving issues [16].                            In order to find relevant claims to a query claim we need to
                                                                 find claims that are semantically similar to the query claim.
                                                                 More precise, we need to find claims that have relevant pre-
3.    RESEARCH PLAN                                              mises to the query. So the challenge is to use basically syn-
  This section illustrates the research plan for implementing    tactic similarity to achieve semantic similarity. In order to
the two-stage retrieval system. We explain the necessity of      estimate the probability that a claim is relevant to the que-
the two stages and challenges we expect.                         ry, we can use any similarity measure we identify for textual
                     Result
                     Claim 1
                                                          Query
                                                          Claim
                                                                   ranking can incorporate factual correctness, convincingness,
                                                                   but also user context such as prior knowledge or belief in ex-
                                                                   pert opinions, assumptions, and preferences. Therefore, we
                                                                   will include quality measures such as those described in [16].
                                                                   However, we need to investigate in the strength of the clus-
                                                                   ter of premises. So far, there are only a few works in the
   Premises                              Result                    early stages of development concerning the quality of single
   Cluster 1                             Claim 2
                                                                   premises [16] but not clusters of premises.

                                                                   3.4 Further Challenges
                                                                      Another problem that should be paid attention to is the
                                                                   premise’s stance, i.e., whether the premise supports or at-
                    Premises                             Result
                    Cluster 2                            Claim 3   tacks the claim. But also the claim’s stance needs to be
                                                                   determined. Consider e.g. the query claim “Nuclear energy
                                                                   should be abolished ” and the claim “Nuclear energy should
                                                                   not be abolished ”. These claims take different views but have
                                                                   a high textual similarity which is why probably many retrie-
                                                                   val methods would output a high similarity. Still the premi-
                                         Premises
                                         Cluster 3                 ses can not be adopted automatically. Moreover, claims often
                                                                   do not have a stance if they are queries like “Should nuclear
                                                                   energy be extended? ” or consist only of terms like “Nuclear
Figure 2: From a query to similar claims to clusters               energy”. One legit possibility for claims with neutral stan-
of premises.                                                       ces is to treat them as implicitly positive. Then, if a query
                                                                   claim and a result claim have the same stance, a premise
                                                                   that supports the result claim also supports the query claim
data such as a plain language model, possibly with additio-        whereas if the query claim and the result claim have oppo-
nal smoothing and taking the textual context of the claim          site stances, a premise that supports the claim will attack
into account.                                                      the query claim and vice versa. Another approach that could
                                                                   make sense is to normalize stances of claims, i.e., to try to
3.3     The Second Stage                                           have only “positive” claims. Alternatively we could revert
   Since we are searching for good premises for a query claim      support and attack for negative claims. Still, that could be
that are obtained from similar claims to the query claim, we       difficult if stance is not fully clear. Nevertheless, there exist
can assume that similar claims often have similar premi-           algorithms for stance detection [12] which we can then use
ses. Furthermore, as we are working with a large corpus of         for this purpose.
arguments, we will find a lot of similar premises, probably           Consider Figure 1 again. As already stated a claim can be
from semantically completely different claims. So instead of        used as premise to support or attack another claim. In this
searching for single premises we group similar premises and        instance, the premise p2 “Expert E states that nuclear energy
search for clusters of premises. For clustering all premises we    will reduce oil dependency” is used to support the argument
can first convert all premises with the same stance into em-       unit p1 “Nuclear energy will reduce oil dependency” which in
bedding vectors and then perform a hierarchical clustering.        turn is used as premise to support the claim C “We should
Instead of computing own models to get embedding vectors           build new nuclear power plants”. We need to investigate in
we can make use of existing models such as the Universal           the transitivity of inferences. In the example in Figure 1 to
Sentence Encoder described in [2]. We can use the Euclidean        which extent e.g. p2 is supportive for C. Analogously to that
distance to compute distances between vectors. Clustering          we need to investigate in the case whether a premise is sup-
can be accomplished with agglomerative clustering, which           portive to a claim if the premise attacks another premise
is a bottom-up approach. Since we prefer smaller clusters to       that in turn attacks the claim. Assume there would be a
keep the number of false positives per cluster to a minimum,       premise p4 “Humans endanger the environment either way”
complete linkage is a good way to connect clusters [9].            that attacks premise p3 “Building nuclear plants endangers
   Figure 2 visualizes an example of the relation between a        the environment” which in turn already attacks claim C in
query, similar claims, and clusters of similar premises. Here,     Figure 1. So we want to examine how supportive premises
we have to answer the research question how often premises         such as p4 are generally to a claim. In [15] Wachsmuth et
which are similar to a premise do appear in claims that are        al. simply adopt these as own premises for the claim. Howe-
similar to the query claim. In order to estimate the probabili-    ver, we will investigate whether a partial score or a damping
ty that a premise cluster should be chosen as supportive for a     factor yields better results. Since we are working with clus-
claim, we can use a simple approach as a frequency-styled ar-      ters of premises we can select one premise as representative.
gument, i.e., we need to count how frequently a premise clus-      This could, for example, be the premise most similar to the
ter from this claim supports similar claims in a large corpus.     centroid in the cluster. Please remember that premises are
Besides that, we can also consider to include inverse docu-        converted to embedding vectors to compute the clusters.
ment frequency-styled arguments, i.e., we need to count how           So far we have considered less complex queries such as
frequently the premise cluster was used as support or attack       “what are good reasons for nuclear energy”. A query howe-
for other claims in a large corpus. Other legit approaches         ver can be much more complicated e.g. by the use of cons-
are to include estimates on truthfulness, appropriateness (of      traints. Such a more complex query could be “what are com-
the premise for the claim), and confidence in expert. The          mon statements with factual evidences of Expert E in the
last three months that nuclear energy is a viable option in       claim. For a subset of our query claims, we will build a pool
Germany”. In this example a user demands factual evidences        of all result premises in the top-k (for some k ∈ N) of all
for a geographically restricted area of a certain expert for a    result lists and let annotators assess the premises’ relevan-
certain topic in a certain time span. Furthermore, the con-       ce as explained above. In addition to that, we can conduct
text could be desired to be restricted to opinions by certain     a user study with more participants to overcome possible
interest groups or parties with certain political orientation     shortcoming of having only few annotators to check the re-
such as left-wing parties. An approach could be to divide         sults. By the use of nDCG at different cutoffs, averaged over
complex queries into sub queries. If the query is expressed       all queries, we can evaluate different retrieval methods for
as a coherent sentence its tree can be derived by the use         this end-to-end analysis.
of Part-of-Speech implementations such as [14]. Then, a cut
can deliver useful sub queries.
                                                                  5. PRELIMINARY RESULTS
                                                                     In this section we give an overview of results we found
4.   EVALUATION PLAN                                              so far by investigating the stages of the two-stage retrieval
   Instead of creating argument collections which is a very       model. First we describe how we built our dataset consisting
time consuming task or automatically mine arguments from          of arguments, then we describe the first, and then the second
natural language texts which might be noisy we will adapt         step of the two-step retrieval process.
the idea of [15] and make use of several debate portals. In          The dataset described in [15] is not publicly available,
fact we use idebate.org, debatewise.org, debatepedia.             therefore we reconstructed a similar dataset following the
org, and debate.org as starting point. While the first three      approach in that paper. We crawled the arguments from
are of high quality, the latter is of lower quality, i.e., some   four debate portals, namely debate.org, debatepedia.org,
few premises consist of insults or nonsense. However, the         debatewise.org, and idebate.org. After the arguments we-
latter contains much more debates as the other three toge-        re extracted, they were indexed with Apache Lucene. In the
ther. We expect this constellation to result in good diversi-     end, this resulted in overall 59,126 claims with 695,818 pre-
fication. The constructions of debate portals already serve       mises, so on average about 11.8 premises per claim.
with argument structures. One questioner asks the commu-             We now describe the first step of the two-step retrieval
nity about a topic, e.g., “Should we build new nuclear power      process. Since real-life query inputs of users are difficult to
plants”. Then users of the community can directly answer          find, we drew a random sample of 233 claims and used them
the questions and substantiate their posts e.g. with facts or     as queries. In order to avoid claims that address completely
examples. Many debate portals also provide the possibility        random topics, our sample contained only claims that are
of adding a stance for or against to an answer, as do the por-    related to the topic “energy”. To do so, we trained a word-
tals we have selected for our study. The main advantages of       embedding-model on the 59,126 claims of our corpus using
debate portals are that the posts are not artificial but close    DeepLearning4j4 . Then, we retrieved the nearest words of
to reality. Besides that they are coherent. Following [15] we     the word energy and filtered out inappropriate suggestions.
use the debate portals’ queries as claims and their answers       Inappropriate suggestions were those that had nothing in
as premises to build arguments.                                   common with our topic energy in the broadest sense. We
   We can divide the evaluation of the two-stage retrieval        repeated this approach five times for all newly added sug-
process into two evaluation steps. First we want to find si-      gestions. In the end, we obtained 44 words such as “nuclear ”,
milar claims to a query claim. This can be achieved via an        “electricity”, “wind ”, “solar ”, “oil ”, “emission”, etc. We got
existing textual similarity method. In order to decide which      1,529 candidate claims where at least one of these words oc-
similarity method is suitable we can take a small number n of     curred, from which we drew a random sample of 233 claims,
query claims and build pools of depth k by a union of result      making sure by manual inspection that they are really re-
claims of existing similarity methods. Then, annotators can       lated to the topic energy. To ensure that we end up with
manually assess the similarity of each (query claim, result       at least 200 valid claims, we have added another 33. In the
claim) pair e.g. in the range between 1 (nothing in common)       end, we removed one claim because it appeared twice. We
and 5 (semantically equal). The question which similarity         considered 196 different retrieval methods5 implemented in
method should be adopted for the retrieval of claims can be       Apache Lucene and retrieved, for each method, result claims
shifted to the question which method’s ranking comes clo-         for our 232 query claims. From the results, we built pools
sest to the annotations. We will use state-of-the-art ranking     of depth 5, i.e., including any claim that appeared in the
measures such as nDCG [8] for the evaluation of rankings.         result list of any method at rank 5 or better. This resulted
   After we determined the most similar claims to a query         in 5171 (query claim, result claim) pairs. Please note, that
claim we want to retrieve their directly tied (clusters of)       pairs where the result claim was equal to the query claim
premises. In order to validate the hypothesis that claims         are already excluded.
highly similar to the query claim also have premises that are
highly relevant for the query claim, we can take a fix number     4
                                                                    Among others we used SkipGram as learning algorithm, the
of (query claim, result claim, result claim premise) pairs of     maximum window size was 8, the word vector size was 1000,
different similarities and let annotators manually assess the      the text was not preprocessed, and the number of iterations
pairs e.g. on a binary scale where the annotators are not         over the whole corpus was 15.
                                                                  5
aware of the actual result claim. The higher the similarity         Apache Lucene (Version 7.6.0) provides 139 similarity me-
                                                                  thods as well as a class for multiple similarities. We tested
is between two claims the more relevant the one’s premises        all combinations of the best methods’ variants of Diver-
should be to the other claim.                                     gence from Randomness, Divergence from Independence,
   Furthermore, we need an end-to-end analysis to evaluate        information-based models, and Axiomatic approaches as
the overall performance of our premise retrieval approach,        well
                                                                  ∑6 as(6BM25
                                                                           )      and Jelinek-Mercer in a first run and got
i.e., how well can our approach retrieve premises for a given        k=2 k   = 57 new methods, resulting in 196 methods.
   The user-perceived similarity of each (query claim, result
claim) pair was independently assessed by at least two an-              Table 1: Relevance levels for claim assessment
                                                                    score   meaning
notators on the scale from 1 to 5. A total of eight people
participated in the annotation. They are all included in the        5       The claims are equal.
ReCAP project and were introduced to the basics of ar-              4       The claims differ in polarity, but are otherwise equal.
gumentation theory. Table 1 explains the meanings of the            3       The claims differ in specificity or extent.
                                                                    2       The claims address the same topic, but are unrelated.
different levels. The underlying assumption of this scale is
                                                                    1       The claims are unrelated.
that all premises of claims rated 4 or 5 should apply to the
query claim, whereas no premises of claims rated 1 should
apply. For claims rated 3, we expect that a good number of
premises match, whereas premises of claims rated 2 would           well at the claim retrieval task, it should also perform well at
only rarely match. The annotators were confronted with the         the subsequent premise retrieval task; the initial hypothesis
query claim and a result claim and were asked to assess how        is thus validated.
well they expect the premises of the result claim (that were
unknown to them) would match the query claim. Since we             6. RELATED WORK
only wanted to measure the relevance of claims at this point,
the actual premises were not considered at this point, but            Wachsmuth et al. [15] introduce one of the first prototy-
investigated later. Since polarity of premises is not in the fo-   pes of an argument search engine called args. Their system
cus of this study, we collapse the levels 4 and 5 into a single    operates on arguments crawled from debate portals. Given a
level 4 for this study. As every pair of query claim and result    user query, the system retrieves, ranks, and presents premi-
claim was assessed by at least two annotators, the final rele-     ses supporting and attacking the query claim, taking simila-
vance value of a result claim for a query claim was computed       rity of the query claim with the premise, its corresponding
as the mean value of the corresponding assessments.                claim, and other contextual information into account. They
   Using the assessed pool of results as a gold standard,          apply a standard BM25F ranking model implemented on top
we evaluated the performance of the 196 retrieval methods          of Lucene. In contrast to their system, we did not restrict
under consideration for the claim retrieval task, using nD-        ourself to BM25 or variants, but evaluated 196 different si-
CG@k [8] with cutoff values k ∈ {1, 2, 5} as quality metric.        milarity methods for claim retrieval.
Our results clearly show that the BM25 [11] scoring method            Stab et al. [12] present ArgumenText, an argument re-
used in previous works is usually not a good choice, especi-       trieval system capable of retrieving topic-relevant sentential
ally for cutoff 5, which is a realistic cutoff for a system that     arguments from a large collection of diverse Web texts for
aims at finding the top-10 premises. In contrast to the me-        any given controversial topic. The system first retrieves re-
thod Divergence from Randomness (DFR) [7], which yielded           levant documents, then it identifies arguments and classifies
an nDCG@5 of 0.7982, BM25 yielded only 0.7616.                     them as “pro” or “con”, and presents them ranked by rele-
   We now focus on the second step of the two-stage retrie-        vance in a web interface. In their implementation, they make
val framework, retrieving the premises of claims similar to        use of Elasticsearch and BM25 to retrieve the top-ranked
the query claim. Our goal here is to verify the assumption         documents. In contrast to this work, we do not consider
made above that claims highly similar to the query claim           the argument mining task, but assume that we operate on
also have premises that are highly relevant for the query          a collection of arguments with claims and premises. Howe-
claim. To systematically approach this question, we formed         ver, in another work Habernal and Gurevych [4] propose
triples of the form (query claim, result claim, result premise)    a semi-supervised model for argumentation mining of user-
from the above-mentioned pool, where the result premise is         generated Web content.
a premise of the result claim. We grouped the triples ac-             In [5], Habernal and Gurevych address the relevance of
cording to the relevance of the result claim to the query          premises to estimate the convincingness of arguments using
claim, forming groups of the relevance ranges [n, n + 0.5)         neural networks. Since relevance underlies a subjective jud-
for n ∈ {n : 1 ≤ n ≤ 3.5} and [4, 4], which yielded seven          gement they first confronted users in a crowdsourced task
groups. Then, we randomly drew 100 (query claim, result            with pairs of premises to decide which premise is more con-
claim, result premise) triples from each group and had two         vincing, and then used a bidrectional LSTM to predict which
annotators manually assess the relevance of the result pre-        argument is more convincing. Wachsmuth et al. [16] consider
mise for the query claim (without seeing the result claim),        the problem of judging the relevance of arguments and provi-
resulting in 1400 assessments. Annotators could choose bet-        de an overview of the work on computational argumentation
ween either not relevant or relevant with three different stan-     quality in natural language, including theories and approa-
ces: query with neutral stance, premise with same stance as        ches. Approaches that predict relevance or convincingness of
query and premise with opposite stance as query. As we did         premises can be useful to rank premises.
with claims before, we ignore the stances of premises since
we only want to focus on their relevance, and many claims          7. CONCLUSION AND FUTURE WORK
of our dataset do not have a stance anyway. We thus con-
                                                                     Retrieving good premises for claims is an important, but
sider only binary relevance for premises from now on. Our
                                                                   difficult problem for which no good solutions exist yet. This
preliminary results support the observation that the more
                                                                   paper has provided some insights that a two-stage retrieval
relevant a claim for the query is, the more relevant premises
                                                                   process that first retrieves claims, and then ranks their clus-
it yields. For example, 80 % of the premises of the result
                                                                   tered premises can be a step towards a solution. The best
claim in interval [4, 4] were relevant to the query claim. In
                                                                   premises are found for the most similar claims, according
comparison, only 6 % of the premises in interval [1, 1.5] were
                                                                   to assessments by human annotators, is already good. We
relevant to the query claim. So if a search engine performs
                                                                   showed that, instead of exhaustively assessing all retrieved
premises for a claim, it is sufficient to assess only the retrie-        Denmark, September 9-11, 2017 - System
ved claims, which is an order of magnitude less work.                  Demonstrations, pages 7–12, 2017.
  Our future work will include ranking methods for premi-          [7] S. P. Harter. A probabilistic approach to automatic
ses. We will also examine additional quality-based premise             keyword indexing. JASIS, 26(4):197–206, 1975.
features [16] such as convincingness or correctness. We plan       [8] K. Järvelin and J. Kekäläinen. Cumulated gain-based
for a public Web application as an interface to our premise            evaluation of IR techniques. ACM Trans. Inf. Syst.,
retrieval system.                                                      20(4):422–446, 2002.
  We will also tackle the task to detect stances. Although         [9] G. N. Lance and W. T. Williams. Mixed-data
debate portals ask users to add stances to the premises, these         classificatory programs I - agglomerative systems.
stances are related to the claim, but the claims’ stances are          Australian Computer Journal, 1(1):15–20, 1967.
not further specified. Hence, premises that support a claim       [10] A. Peldszus and M. Stede. From argument diagrams
may attack a claim with an opposite stance and vice versa.             to argumentation mining in texts: A survey. IJCINI,
                                                                       7(1):1–31, 2013.
8.   ACKNOWLEDGMENTS                                              [11] S. E. Robertson and H. Zaragoza. The probabilistic
  I would like to thank my supervisor Ralf Schenkel for his            relevance framework: BM25 and beyond. Foundations
invaluable help in creating this paper.                                and Trends in Information Retrieval, 3(4):333–389,
  This work has been funded by the Deutsche Forschungsge-              2009.
meinschaft (DFG) within the project ReCAP, Grant Num-             [12] C. Stab, J. Daxenberger, C. Stahlhut, T. Miller,
ber 375342983 - 2018-2020, as part of the Priority Program             B. Schiller, C. Tauchmann, S. Eger, and I. Gurevych.
”Robust Argumentation Machines (RATIO)” (SPP-1999).                    Argumentext: Searching for arguments in
                                                                       heterogeneous sources. In Proceedings of the 2018
                                                                       Conference of the North American Chapter of the
9.   REFERENCES                                                        Association for Computational Linguistics,
 [1] R. Bergmann, R. Schenkel, L. Dumani, and                          NAACL-HLT 2018, New Orleans, Louisiana, USA,
     S. Ollinger. Recap - information retrieval and                    June 2-4, 2018, Demonstrations, pages 21–25, 2018.
     case-based reasoning for robust deliberation and             [13] C. Stab, C. Kirschner, J. Eckle-Kohler, and
     synthesis of arguments in the political discourse. In             I. Gurevych. Argumentation mining in persuasive
     Proceedings of the Conference ”Lernen, Wissen,                    essays and scientific articles from the discourse
     Daten, Analysen”, LWDA 2018, Mannheim, Germany,                   structure perspective. In Proceedings of the Workshop
     August 22-24, 2018., pages 49–60, 2018.                           on Frontiers and Connections between Argumentation
 [2] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S.              Theory and Natural Language Processing,
     John, N. Constant, M. Guajardo-Cespedes, S. Yuan,                 Forlı̀-Cesena, Italy, July 21-25, 2014., 2014.
     C. Tar, B. Strope, and R. Kurzweil. Universal                [14] K. Toutanova, D. Klein, C. D. Manning, and
     sentence encoder for english. In Proceedings of the               Y. Singer. Feature-rich part-of-speech tagging with a
     2018 Conference on Empirical Methods in Natural                   cyclic dependency network. In Human Language
     Language Processing, EMNLP 2018: System                           Technology Conference of the North American Chapter
     Demonstrations, Brussels, Belgium, October 31 -                   of the Association for Computational Linguistics,
     November 4, 2018, pages 169–174, 2018.                            HLT-NAACL 2003, Edmonton, Canada, May 27 -
                                                                       June 1, 2003, 2003.
 [3] J. Eckle-Kohler, R. Kluge, and I. Gurevych. On the
     role of discourse markers for discriminating claims and      [15] H. Wachsmuth, M. Potthast, K. A. Khatib, Y. Ajjour,
     premises in argumentative discourse. In Proceedings of            J. Puschmann, J. Qu, J. Dorsch, V. Morari,
     the 2015 Conference on Empirical Methods in Natural               J. Bevendorff, and B. Stein. Building an argument
     Language Processing, EMNLP 2015, Lisbon, Portugal,                search engine for the web. In Proceedings of the 4th
     September 17-21, 2015, pages 2236–2242, 2015.                     Workshop on Argument Mining, ArgMining@EMNLP
 [4] I. Habernal and I. Gurevych. Exploiting debate                    2017, Copenhagen, Denmark, September 8, 2017,
     portals for semi-supervised argumentation mining in               pages 49–59, 2017.
     user-generated web discourse. In Proceedings of the          [16] H. Wachsmuth, B. Stein, G. Hirst, V. Prabhakaran,
     2015 Conference on Empirical Methods in Natural                   Y. Bilu, Y. Hou, N. Naderi, and T. Alberdingk Thijm.
     Language Processing, EMNLP 2015, Lisbon, Portugal,                Computational argumentation quality assessment in
     September 17-21, 2015, pages 2127–2137, 2015.                     natural language. In Proceedings of the 15th
 [5] I. Habernal and I. Gurevych. Which argument is more               Conference of the European Chapter of the Association
     convincing? analyzing and predicting convincingness               for Computational Linguistics, EACL 2017, Valencia,
     of web arguments using bidirectional LSTM. In                     Spain, April 3-7, 2017, Volume 1: Long Papers, pages
                                                                       176–187, 2017.
     Proceedings of the 54th Annual Meeting of the
     Association for Computational Linguistics, ACL 2016,         [17] D. Walton, C. Reed, and F. Macagno. Argumentation
     August 7-12, 2016, Berlin, Germany, Volume 1: Long                Schemes. Cambridge University Press, 2008.
     Papers, 2016.
 [6] I. Habernal, R. Hannemann, C. Pollak, C. Klamm,
     P. Pauli, and I. Gurevych. Argotario: Computational
     argumentation meets serious games. In Proceedings of
     the 2017 Conference on Empirical Methods in Natural
     Language Processing, EMNLP 2017, Copenhagen,