Creating an Argument Search Engine for Online
                  Debates
Notebook for the Touché Lab on Argument Retrieval at CLEF 2020


         Maximilian Bundesmann1 , Lukas Christ2 , and Matthias Richter3
                            1
                              University of Leipzig, Germany
                           mb74fawu@studserv.uni-leipzig.de
                           2
                              University of Leipzig, Germany
                            lc85futa@studserv.uni-leipzig.de
                           3
                              University of Leipzig, Germany
                           mr75syri@studserv.uni-leipzig.de


        Abstract. Consulting web search engines has become an everyday pro-
        cedure for many internet users. One specific task that gained attention in
        recent work is the retrieval of arguments for controversial topics. Most of
        the preexisting difficulties that search engines have to face also apply for
        this task. However, certain challenges become even more important, such
        as providing an appropriate heterogeneity in the result set. We present
        an argument search engine for the argsme corpus. Our focus is on pre-
        processing the corpus while also addressing the heterogeneity problem
        and implementing a query expansion feature. Furthermore, we provide a
        brief evaluation of our retrieval results.

        Keywords: Information Retrieval · Argumentative Conversations · On-
        line Debates · Argument Search


1     Introduction
Nowadays, search engines are used by everybody. Consulting a search engine is
the easiest way to find desired information like today’s weather, news articles
or just any arbitrary image. However, there are still some problems for which
modern search engines fail to deliver satisfying answers yet. One of these chal-
lenges is the search for arguments in large document collections, e.g. for debates
such as “Are plastic bottles good?” or “Are speed limits wrong?”. Search engines
that find arguments for these kinds of queries could be classified as argument
search engines. There are already some solutions available, e.g. Args.me [22] or
ArgumenText [20].
    This report is created in the context of the Touché shared task on argument
retrieval [5]. Its goal is to develop an argument search engine that retrieves ar-
guments from the argsme corpus [2], which provides almost 390,000 arguments
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
from over 55,000 online debates. We develop a search engine to find good ar-
guments in the corpus. This report describes our approach and evaluates its
performance.
    First of all in Section 2, we summarize the state of the art of information
retrieval in the context of argument search engines. Section 3 introduces our
search engine’s architecture. Following, we give a short overview of the corpus,
describe some necessary preprocessing steps and show the ideas behind all sep-
arate components. In Section 4 we present the results of the final evaluation.
    Argument search refers to collecting relevant premises and conclusions to a
given topic that is usually of controversial nature. The goal of such a search
engine is to provide the user with supported statements that help him to gather
knowledge about his topic of interest and potentially assist his decision making.


2     Related Work

Previous work has been carried out that tackled various tasks of argument search,
including automatically detecting evidence that support a given claim [19], de-
termining argument relevance [17] or acquiring a corpus of arguments [1]. The
latter one along with the work presented by Wachsmuth et al. [22] constitutes
the basis for this work.


2.1   Retrieval Models

The heart of an argument search engine is a proper retrieval model. The challenge
here is to find the best arguments w.r.t. a corpus and a free-text query.
    Several argument search engines are e.g. args.me [22] or ArgumenText [20].
These engines are based on the retrieval model Okapi BM25. Potthast et al. [17]
performed a user study to evaluate different well-known retrieval models. In this
examination Lucene’s BM25, Terrier’s implementations of DPH [3], DirichletLM
[25], and TFIDF were considered. The retrieved arguments by the different re-
trieval models were rated by their relevance, rhetoric, logic, and dialectic quality.
DPH proved to yield the best results overall.


2.2   Preprocessing

Predicting argument quality is a challenging task. Wachsmuth et al. [21] discuss
the concept of argument quality. Wei et al. [24] rank argumentative reddit posts
in order to find the most persuasive ones. Their approach is machine learning-
based. The same applies for the approach proposed by Persing and Ng [16]. Pot-
thast et al. [17] provide a subset of the argsme corpus in which arguments were
manually annotated with a rating for the quality aspects defined by Wachsmuth
et al. [21]. Furthermore, an overall quality rating was assigned to each argument.
2.3    Query Expansion
Several different approaches have been explored that aim to improve the recall
for user queries [8]. Some examples are pseudo-relevance feedback (using terms
from the top ranked documents) or interactive query refinement that requires the
user to readjust his query. Another option is to use search query logs to obtain
rewritings that users perform to improve their query terms with respect to their
information need. However, some of these methods are out of the scope for this
work, or require additional data. For instance, search query logs are unavailable
for this task. Therefore, we focus on a few automatic query expansion (AQE)
methods based on word embeddings. Diaz et al. [12] and Zuccon et al. [26], for
instance, utilized such “model-based” approaches.

2.4    Clustering
The aim of clustering here is to guarantee a diverse result set in order to present
the user a variety of different arguments. Carbonell and Goldstein [7] introduce
Maximal Marginal Relevance (MMR), a measure that allows building a ranking
incrementally. MMR balances the quality and heterogeneity of the result set.
    Another approach to build a diverse ranking incrementally is proposed by
Kaptein et al. [15]. Deselaers et al. [11] describe a method to diversify image
search results using a ”novelty” measure. A problem in diversifying search results
is that classic evaluation measures like (n)DCG do not take diversity of results
into account. Clarke et al. [9] thus propose an alternative evaluation framework.


3     Methods
Our argument search engine’s architecture is depicted in Figure 1. The cen-
tral module is the Apache Lucene Core4 search library that realizes indexing
and retrieval. Before indexing, the corpus is preprocessed. During preprocessing,
quality ratings for all documents are computed.
    At run time, the user’s queries are enriched by a query expansion module.
Then, after retrieving a set of relevant documents via the Lucene Core, results
are ranked considering both the scores obtained by the retrieval system and the
quality ratings. Eventually, the last component can perform clustering on the
top ranked results.

3.1    Preprocessing
The quality of the arguments contained in the corpus is heterogeneous. Some
documents do not contain arguments at all. Therefore we aim to assign ratings to
the documents indicating their argumentative quality. More formally, we create
a mapping
                                 q : D → [0, 1]                              (1)
4
    https://lucene.apache.org/
  Fig. 1. All components of the introduced pipeline can be changed independently.


where D is the corpus as a set of documents. If q(d1 ) > q(d2 ), the argumentative
quality of d1 is considered higher than that of d2 .
These ratings are then used in our retrieval model, as the user should only
receive arguments with high quality. To compute them, we employ a machine
learning approach. Argument quality is a rather elusive concept that can not be
quantified directly. Wachsmuth et al. [21] break it down into three aspects:

 – Logical quality: are the premises acceptable and do they really imply the
   conclusion?
 – Rhetorical quality: is the argument formulated in a persuasive manner?
 – Dialectical quality: does the argument contribute to resolving the issue?

    Capturing the logical dimension of argument quality with computational fea-
tures is a hard task. Solving it is beyond the scope of this project.The dialectical
quality dimension is not available in our corpus either. Other than the corpus
of reddit posts used by Wei et al. [24], the argsme corpus does not contain in-
formation about replies to a post or citations of a post. Thus, the only quality
dimension we aim to quantify is rhetorical quality.
To achieve this, we compute 22 features for each argument. Most of them can
already be found in Wei et al. [24] and Persing and Ng [16]. In the following, the
features are briefly described.
Linguistic competence features aim at quantifying the argument’s author’s lin-
guistic skills: average sentence length, word length, type/token ratio, number of
punctuation marks per sentence, different POS-Tags (whole text), conjunctives
per sentence, modal verbs per sentence, emojis (Use of emojis might coincide
with rather colloquial language and lack of seriousness), non-stopwords ratio.
    Sources and Examples: claims are more persuasive when they are supported
by examples and sources. The following rule-based features are intended to cap-
ture them: number of references per sentence, examples per sentence, URLs per
sentence, percentages per sentence, year specifications per sentence
    Subjectivity, ad hominem and emotionality: Arguments are more persuasive
when they are presented in an objective manner, without anecdotal evidence or
attacking the opponent personally. We aim to quantify subjectivity and emo-
tionality with the following features:

 – number of first person pl. pronouns per sentence indicate subjectivity. We do
   not count first person singular words. Persing and Ng [16] argue that objec-
   tive arguments frequently start with phrases like “I think...” or “I believe...”,
   too.
 – number of second person pronouns per sentence may indicate personal at-
   tacks
 – Sentiment Analysis is able to indicate high emotionality. We use VADER
   (Hutto et al. [14]).
 – hedge words/phrases per sentence: these phrases may indicate a more po-
   lite, indirect and differentiated formulation. We use a list5 to identify such
   phrases.
 – number of definite articles / number of articles: Persing and Ng [16] argue
   that a lack of definite articles often means a lack of specifity and objectivity
 – average concreteness: Brysbaert et al. [6] provide ratings for word concrete-
   ness obtained by crowd sourcing. This feature describes the average degree
   of abstractness/concreteness in the argument.
 – components of emotions: Words affect our emotions. According to Warriner
   et al. [23], there are three components of each emotion:
     • valence, i.e. “pleasantness”([23]), ranges from “happy” to “unhappy”
     • arousal is “the intensity of emotion provoked by a stimulus” ([6])
     • dominance denotes “the degree of control exerted by a stimulus” ([23])
   For each of these emotion components, Warriner et al. provide word ratings.
   We build three features: average valence of words in the argument, average
   arousal and average dominance.

    Before normalizing all features we filter out odd documents based on rules.
To give an example, the average word length in an argument is expected to be
between 2 and 16. Odd documents are assigned the rating 0.0. Such documents
are typically spam or short meta-posts like e.g. “I accept”, “Vote Pro” etc.
    As training data we use the Webis-ArgQuality-20 Corpus [13]. It contains
about 1600 arguments from the argsme corpus. Furthermore, it provides ratings
for all three argument quality dimensions as well as for combined/overall argu-
ment quality. These continuous ratings range from -4.0 (not an argument) to
4.0.
    We train several several machine learning models: Linear Regression, Decision
Tree Regression and Support Vector Regression (SVR) with different kernels.
For each type of model we train one instance on rhetorical quality and another
instance on combined quality. Both instances’ parameters are optimized via grid
search.
5
    https://github.com/words/hedges
    All models perform rather poorly, confirming that argument quality predic-
tion is a difficult problem. SVR with a quadratic kernel achieves the best results
(MSE of 1.641 for rhetorical and 1.475 for combined quality). Moreover, we train
an ensemble model (Linear Regression) using the predictions of all models as fea-
tures. As expected, it outperforms all single models (MSE of 1.468 for rhetorical
and 1.322 for combined quality).
    An interesting detail is that all models, even those trained on rhetorical qual-
ity, perform better in predicting combined argument quality than in predicting
rhetorical quality. In other words, overall quality seems to be easier to grasp
than rhetorical quality, at least with our approach. This hypothesis is statisti-
cally significant for p < 0.01. One explanation may be that some of our features
also capture aspects of dialectical and logical quality: For example, providing
sources to support a claim could indicate logical correctness. Features related to
subjectivity and emotionality might at least be able to suggest low dialectical
quality, as a very emotional and/or subjective post is often unlikely to contribute
to resolving an issue.
    Finally, to obtain the desired quality function q : D → [0, 1], we let the trained
models predict the combined quality of every argument in the argsme corpus.
We compute the predictions of the best single model (quadratic SVR trained on
combined quality) and the ensemble model, leading to two candidates qsvr , qens
for q. Figure 2 shows the distributions of the ratings generated by both models.


Fig. 2. Distributions of ratings produced by ensemble classifier (qens ) and SVR classi-
fier (qsvr )


    We choose qsvr for q, even though the SVR model’s MSE is higher than that
of the ensemble method. The main reason for this decision is that there are
almost no “bad” arguments according to qens , which is certainly inaccurate.
3.2     Retrieval Model

Like the implementation of args.me we decide to use Apache Lucene for the
indexing and retrieval tasks. We index the extended corpus which is generated
during preprocessing. In the first step of query processing, stopwords are re-
moved from the query. Before results are retrieved, the query is extended using
additional query expansion methods. For the ranking, we implement different
methods. As a baseline, we use Lucene’s BM25 implementation. Furthermore,
we extend the Lucene search core with an implementation of the DPH concept
[4]. These retrieval methods do not consider the quality ratings q of our extended
corpus. To gain a profit from q we perform a reranking. The scoring function for
a document d is given by:

                           score(d) = α · s0 (d) + (1 − α) · q(d)             (2)
    where s0 is the normalized score of the retrieval model. A reasonable value
of α (α ∈ [0, 1]) can be determined empirically. Initially, we set α = 0.5.


3.3     Query Expansion

A query is a short representation of the user’s information need. However, these
few words may not be sufficient to encompass the entire concept that the user
wants to express. This can lead to highly relevant documents not being found by
the retrieval system due to vocabulary missmatch. That is, the user may choose
terms for his query that do not appear in a relevant document. To mitigate this
gap, automatic query expansion methods can be used. In this section we briefly
describe the components of AQE and our implementation.
    For our query expansion component we decided to use one simple baseline
approach and two more sophisticated concepts. As baseline, we employ WordNet
to fetch semantically similar words for each individual query term. This method
can not grasp the concept of the entire query as one unit. However, as many of
the queries provided for the shared task only consist of few terms, such as “speed
limit” or “nuclear weapons”, this simple AQE method can potentially provide a
useful enhancement.
    The other two expansion procedures both rely on word embeddings. We use
fastText6 to obtain vector representations from the argsme corpus. We com-
bine these locally trained representations with pre-trained embeddings offered
by fastText. Then, we adapt a query expansion method as proposed by Diaz et
al. [12]. This model-based expansion procedure searches the word embeddings
for semantically similar terms in order to estimate an alternative to the original
query by interpolating the query language model pq with that of the expansion
language pq+ as follows:

                            p1q (w) = λpq (w) + (1 − λ)pq+ (w)                (3)
6
    https://fasttext.cc/
All newly found terms are then weighted and the best ones (matching the mod-
eled language) are selected to augment the query.
    Even though the work presented by Zuccon et al. [26] does not directly focus
on AQE, we also use their insights to realize another expansion method. They
investigate different ways to estimate translation probabilities for terms that
belong to the same language model. Similarly, our goal for AQE is to find words
w that are likely to be ”translations” of the initial query terms:

                           pt (w|q) = Σu∈q pt (w|u)p(u|q)                         (4)

where pt (w|u) describes the probability of translating term u into w which can
be approximated by a normalized cosine similarity.
   Naturally, both expansion techniques operate on each query as a whole to in-
corporate their relatedness. Eventually, the expansion terms and their respective
weights are returned to our search core. Note that for the current implementation
only one expansion method is used at a time.

3.4   Clustering and Reranking
A more experimental component of our search engine is the clustering/reranking
module.
    In retrieving arguments, not only the argumentative quality of the returned
results is important. Another aspect of an argument search engine’s utility is
the heterogeneity of the returned arguments. In every use case, the user benefits
from receiving a wide variety of semantically different arguments.
    A problem of the argsme corpus is that an argumentative document usually
contains more than one argument. Nevertheless, documents may often be seman-
tically similar. Moreover, optimizing heterogeneity can conflict with optimizing
quality. Both goals need to be balanced.
    As the Touchè task is evaluated using nDCG, we first make sure that our
results are of high quality (w.r.t to the query and the argumentation quality).
Then, the top 8 results are clustered and reranked in order to diversify the top
results.
    Semantic clustering is implemented using Latent Semantic Analysis (Deer-
wester et al. [10]) for 3 topics. This provides a vector of size 3 for each of the top
8 documents. Now, the distance dist(d1 , d2 ) = 1 − SIM (d1 , d2 ), i.e. dissimilarity
between two documents d1 and d2 can be described in terms of the 3-dimensional
vectors generated by LSA.
    In the following, let R be the ranking and R[i] the document with rank
i in R. Similar to Deselaers et al. [11], we employ a notion of a document’s
novelty. Novelty of a document R[i] is related to R[i]0 s predecessors in the ranking
R[1]...R[i − 1]:
                                       i−1 1
                       N ov(R[i]) := Σk=1    dist(R[i], R[i − k])                  (5)
                                           k
We weight the dissimilarity depending on the number of ranks between R[j] and
R[j − k]: documents should not be similar to their immediate predecessor.
    Based on novelties, we define a measure for R’s diversity/heterogeneity:

                                           |R| 1
                      heterogeneity(R) := Σj=2 N ov(R[j])                         (6)
                                               j
The likelihood that a user actually looks at a document d decreases with d’s rank.
Because of that, our heterogeneity measure weights each document’s novelty
depending on its rank.
   Next, heterogenity of a ranking R needs to be balanced with R’s quality. To
achieve this, we compute a reranking R0 of R that maximizes

                  γ ∗ quality(R0 ) + (1 − γ) ∗ heterogeneity(R0 )                 (7)
    We use nDCG with our retrieval model’s ratings to compute quality(R0 ).
Note that the problem of finding an optimal R0 can be framed as Mixed Integer
Program. However, since we restrict ourselves to reranking only the top 8 doc-
uments, we find an optimal solution using brute force. Table 1 shows the effect
of our reranking on a dummy corpus.


Table 1. Examples for reranking on a dummy subset of the argsme corpus, for different
values of γ in Equation 7. In Brackets the dummy quality value for each “argument”.

Initial ranking (γ = 1.)           γ = 0.5                   γ=0
Vote Con! (1.0)                    Vote Con! (1.0)           Vote PRO. (0.625)
Vote Con (0.875)                   Vote for Con. (0.75)      Please extend ... (0.25)
Vote for Con. (0.75)               Extend my arg... (0.375) vote for pro (0.5)
Vote PRO. (0.625)                  Vote Con (0.875)          extend all arg... (0.125)
vote for pro (0.5)                 Please extend ... (0.25) Vote Con (0.875)
Extend my arguments. (0.375)       vote for pro (0.5)        Vote for Con. (0.75)
Please extend all arguments (0.25) Vote PRO. (0.625)         Extend my arg... (0.375)
extend all arguments (0.125)       extend all arg... (0.125) Vote Con! (1.0)


   The hyperparameter γ ∈ [0, 1] in Equation 7 could be set by the user. Alter-
natively, γ could be further investigated in order find a reasonable value. This is
beyond the scope of our project. For the evaluation, we turn off the clustering
component (i.e. set γ to 1), because nDCG does not consider heterogeneity.


4    Evaluation and Results
For the final evaluation of our system, we decide to use the combination of DPH
and baseline query expansion. Additionally, we augment the scoring function
with our quality ratings as described in Equation 2 using α = 0.5. The clustering
component is not used, since it can not be expected to have a positive impact on
nDCG scores, as pointed out in Section 3.4. Among the various retrieval models,
DPH should show the best performance according to the findings of Potthast
et al. [17]. Moreover, some quick experiments with manually labelled test data
have shown that among our query expansion methods, the baseline expansion
achieves the most satisfying results.
    The result of the final run which was evaluated via tira.io [18] reaches a sound
nDCG@5 of 0.804. For the older version of the corpus, this run was the best
performing among all participants7 , indicating that the employed combination
could be suitable to perform argument search tasks. We did not submit a run
for the more recent corpus version.


5     Conclusion and Outlook

We implemented an argument search engine for the argsme corpus. The results,
however, are not very convincing yet. We suppose they could be improved in
future work, considering the following aspects. As the search engine proved to
benefit from our argument quality ratings, these ratings could be further inves-
tigated. More sophisticated features and models could be tested. What is more,
the weighting of the ratings in our retrieval model, i.e. the hyperparameter α in
Equation 2, could be optimized. One of the major downsides of our approach is
that it does not analyse the semantics of potentially relevant documents. Thus,
the precision is often rather low. Future work could tackle this issue. A closer
investigation of the query expansion component (e.g. investigating more queries)
would probably improve our search engine’s results, too. We implement a rerank-
ing component to diversify the top-ranked results. However, we were not able to
evaluate its quality within the scope of this work. What is more, the reranking
component is only implemented in a proof-of-concept style, limited to the top 8
documents.
    To conclude, we aimed to address the complex problem of argument retrieval
using several different methods. There is much space for extending and enhancing
our approach in order to improve its performance.


References

 1. Ajjour, Y., Wachsmuth, H., Kiesel, J., Potthast, M., Hagen, M., Stein, B.: Data
    acquisition for argument search: The args. me corpus. In: Joint German/Austrian
    Conference on Artificial Intelligence (Künstliche Intelligenz). pp. 48–59. Springer
    (2019)
 2. Ajjour, Y., Wachsmuth, H., Kiesel, J., Potthast, M., Hagen, M., Stein, B.: Data
    Acquisition for Argument Search: The args.me corpus. In: Benzmüller, C., Stuck-
    enschmidt, H. (eds.) 42nd German Conference on Artificial Intelligence (KI 2019).
    pp. 48–59. Springer (Sep 2019). https://doi.org/10.1007/978-3-030-30179-8 4
 3. Amati, G.: Frequentist and bayesian approach to information retrieval. In: Euro-
    pean Conference on Information Retrieval. pp. 13–24. Springer (2006)
 4. Amati, G.: Frequentist and bayesian approach to information retrieval. In: Euro-
    pean Conference on Information Retrieval. pp. 13–24. Springer (2006)
7
    We were automatically assigned the team name Weiss Schnee
 5. Bondarenko, A., Fröbe, M., Beloucif, M., Gienapp, L., Ajjour, Y., Panchenko, A.,
    Biemann, C., Stein, B., Wachsmuth, H., Potthast, M., Hagen, M.: Overview of
    Touché 2020: Argument Retrieval. In: Working Notes Papers of the CLEF 2020
    Evaluation Labs (Sep 2020)
 6. Brysbaert, M., Warriner, A.B., Kuperman, V.: Concreteness ratings for 40 thou-
    sand generally known english word lemmas. Behavior research methods 46(3),
    904–911 (2014)
 7. Carbonell, J., Goldstein, J.: The use of mmr, diversity-based reranking for re-
    ordering documents and producing summaries. In: Proceedings of the 21st annual
    international ACM SIGIR conference on Research and development in information
    retrieval. pp. 335–336 (1998)
 8. Carpineto, C., Romano, G.: A survey of automatic query expansion in information
    retrieval. Acm Computing Surveys (CSUR) 44(1), 1–50 (2012)
 9. Clarke, C.L., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher,
    S., MacKinnon, I.: Novelty and diversity in information retrieval evaluation. In:
    Proceedings of the 31st annual international ACM SIGIR conference on Research
    and development in information retrieval. pp. 659–666 (2008)
10. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: In-
    dexing by latent semantic analysis. Journal of the American society for information
    science 41(6), 391–407 (1990)
11. Deselaers, T., Gass, T., Dreuw, P., Ney, H.: Jointly optimising relevance and di-
    versity in image retrieval. In: Proceedings of the ACM international conference on
    image and video retrieval. pp. 1–8 (2009)
12. Diaz, F., Mitra, B., Craswell, N.: Query expansion with locally-trained word em-
    beddings. arXiv preprint arXiv:1605.07891 (2016)
13. Gienapp, L., Stein, B., Hagen, M., Potthast, M.: Efficient Pairwise Annotation of
    Argument Quality. In: 58th Annual Meeting of the Association for Computational
    Linguistics (ACL 2020). pp. 5772–5781. Association for Computational Linguistics,
    Online (Jul 2020), https://www.aclweb.org/anthology/2020.acl-main.511
14. Hutto, C.J., Gilbert, E.: Vader: A parsimonious rule-based model for sentiment
    analysis of social media text. In: Eighth international AAAI conference on weblogs
    and social media (2014)
15. Kaptein, R., Koolen, M., Kamps, J.: Result diversity and entity ranking exper-
    iments: Anchors, links, text and wikipedia. Tech. rep., AMSTERDAM UNIV
    (NETHERLANDS) INTELLIGENT SYSTEMS LAB AMSTERDAM (2009)
16. Persing, I., Ng, V.: Why can’t you convince me? modeling weaknesses in unper-
    suasive arguments. In: IJCAI. pp. 4082–4088 (2017)
17. Potthast, M., Gienapp, L., Euchner, F., Heilenkötter, N., Weid-
    mann, N., Wachsmuth, H., Stein, B., Hagen, M.: Argument Search:
    Assessing Argument Relevance. In: 42nd International ACM Con-
    ference on Research and Development in Information Retrieval (SI-
    GIR 2019). ACM (Jul 2019). https://doi.org/10.1145/3331184.3331327,
    http://doi.acm.org/10.1145/3331184.3331327
18. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research
    Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation
    in a Changing World. The Information Retrieval Series, Springer (Sep 2019).
    https://doi.org/10.1007/978-3-030-22948-1 5
19. Rinott, R., Dankin, L., Alzate Perez, C., Khapra, M.M., Aharoni, E., Slonim,
    N.: Show me your evidence - an automatic method for context dependent ev-
    idence detection. In: Proceedings of the 2015 Conference on Empirical Meth-
    ods in Natural Language Processing. pp. 440–450. Association for Computational
    Linguistics, Lisbon, Portugal (Sep 2015). https://doi.org/10.18653/v1/D15-1050,
    https://www.aclweb.org/anthology/D15-1050
20. Stab, C., Daxenberger, J., Stahlhut, C., Miller, T., Schiller, B., Tauchmann, C.,
    Eger, S., Gurevych, I.: Argumentext: Searching for arguments in heterogeneous
    sources. In: Proceedings of the 2018 conference of the North American chapter of
    the association for computational linguistics: demonstrations. pp. 21–25 (2018)
21. Wachsmuth, H., Naderi, N., Hou, Y., Bilu, Y., Prabhakaran, V., Thijm, T.A.,
    Hirst, G., Stein, B.: Computational argumentation quality assessment in natural
    language. In: Proceedings of the 15th Conference of the European Chapter of the
    Association for Computational Linguistics: Volume 1, Long Papers. pp. 176–187
    (2017)
22. Wachsmuth, H., Potthast, M., Al-Khatib, K., Ajjour, Y., Puschmann, J., Qu, J.,
    Dorsch, J., Morari, V., Bevendorff, J., Stein, B.: Building an argument search
    engine for the web pp. 49–59 (Sep 2017). https://doi.org/10.18653/v1/W17-5106,
    https://www.aclweb.org/anthology/W17-5106
23. Warriner, A.B., Kuperman, V., Brysbaert, M.: Norms of valence, arousal, and
    dominance for 13,915 english lemmas. Behavior research methods 45(4), 1191–
    1207 (2013)
24. Wei, Z., Liu, Y., Li, Y.: Is this post persuasive? ranking argumentative comments
    in online forum. In: Proceedings of the 54th Annual Meeting of the Association for
    Computational Linguistics (Volume 2: Short Papers). pp. 195–200 (2016)
25. Zhai, C., Lafferty, J.: A study of smoothing methods for language models ap-
    plied to ad hoc information retrieval. In: Proceedings of the 24th Annual Inter-
    national ACM SIGIR Conference on Research and Development in Information
    Retrieval. Association for Computing Machinery, New York, NY, USA (2001).
    https://doi.org/10.1145/383952.384019, https://doi.org/10.1145/383952.384019
26. Zuccon, G., Koopman, B., Bruza, P., Azzopardi, L.: Integrating and evaluating
    neural word embeddings in information retrieval. In: Proceedings of the 20th Aus-
    tralasian document computing symposium. pp. 1–8 (2015)