=Paper=
{{Paper
|id=Vol-2936/paper-205
|storemode=property
|title=Overview of Touché 2021: Argument Retrieval
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-205.pdf
|volume=Vol-2936
|authors=Alexander Bondarenko,Lukas Gienapp,Maik Fröbe,Meriem Beloucif,Yamen Ajjour,Alexander Panchenko,Chris Biemann,Benno Stein,Henning Wachsmuth,Martin Potthast,Matthias Hagen
|dblpUrl=https://dblp.org/rec/conf/clef/BondarenkoGFBAP21
}}
==Overview of Touché 2021: Argument Retrieval==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-205.pdf</pdf>
<pre>
                                       Overview of Touché 2021: Argument Retrieval
                                                                                       Extended Version*

Alexander Bondarenko1 , Lukas Gienapp2 , Maik Fröbe1 , Meriem Beloucif3 ,
Yamen Ajjour1 , Alexander Panchenko4 , Chris Biemann3 , Benno Stein5 ,
Henning Wachsmuth6 , Martin Potthast2 and Matthias Hagen1
1
  Martin-Luther-Universität Halle-Wittenberg
2
  Leipzig University
3
  Universität Hamburg
4
  Skolkovo Institute of Science and Technology
5
  Bauhaus-Universität Weimar
6
  Paderborn University
touche@webis.de https://touche.webis.de


                                         Abstract
                                         This paper is a report on the second year of the Touché shared task on argument retrieval held at
                                         CLEF 2021. With the goal to provide a collaborative platform for researchers, we organized two tasks:
                                         (1) supporting individuals in finding arguments on controversial topics of social importance and (2) sup-
                                         porting individuals with arguments in personal everyday comparison situations.
                                              Unlike in the first year, several of the 27 teams participating in the second year managed to submit
                                         approaches that improved upon argumentation-agnostic baselines for the two tasks. Most of the teams
                                         made use of last year’s Touché data for parameter optimization and fine-tuning their best configurations.


                                         Keywords
                                         Argument retrieval, Controversial questions, Comparative questions, Shared task


1. Introduction
Informed decision making and opinion formation are natural routine tasks. Generally, both
of these tasks often involve weighing two or more options. Any choice to be made may be
based on personal prior knowledge and experience, but they may also often require searching
and processing new knowledge. With the ubiquitous access to various kinds of information on
the web—from facts over opinions and anecdotes to arguments—everybody has the chance to
acquire knowledge for decision making or opinion formation on almost any topic. However,
large amounts of easily accessible information imply challenges such as the need to assess their
relevance to the specific topic of interest and to estimate how well an implied stance is justified;
no matter whether it is about topics of social importance or “just” about personal decisions.
In the simplest form, such a justification might be a collection of basic facts and opinions.

*This overview extends that published as part of the CLEF 2021 proceedings [1].
CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
More complex justifications are often grounded in argumentation, though; for instance, a
complex relational aggregation of assertions and evidence pro or con either side, where different
assertions or evidential statements support or refute each other.
   Furthermore, while web resources such as blogs, community question answering sites, news
articles, or social platforms contain an immense variety of opinions and argumentative texts,
a notable proportion of these may be of biased, faked, or populist nature. This has motivated
argument retrieval research to focus not only on the relevance of arguments, but also on the
aspect of their quality. While conventional web search engines support the retrieval of factual
information fairly well, they hardly address the deeper analysis and processing of argumentative
texts, in terms of mining argument units from these texts, assessing the quality of the arguments,
or classifying their stance. To address this, the argument search engine args.me [2] was
developed to retrieve arguments relevant to a given controversial topic and to account for
the pro or con stance of individual arguments in the result presentation. So far, however, it is
limited to a document collection crawled from a few online debate portals, and largely disregards
quality aspects. Other argument retrieval systems such as ArgumenText [3] and TARGER [4]
take advantage of the large web document collection Common Crawl, but their ability to
reliably retrieve arguments to support sides in a decision process is limited. The comparative
argumentation machine CAM [5], a system for argument retrieval in comparative search, tries
to support decision making in comparison scenarios based on billions of individual sentences
from the Common Crawl. Still, it lacks a proper ranking of diverse longer argumentative texts.
   To foster research on argument retrieval and to establish an exchange of ideas and datasets
among researchers, we organized the second Touché lab on argument retrieval at CLEF 2021.1
Touché is a collaborative platform2 to develop and share retrieval approaches that aim to support
decisions at a societal level (e.g., “Should hate speech be penalized more, and why?”) and at a
personal level (e.g., “Should I major in philosophy or psychology, and why?”), respectively. The
second year of the Touché lab featured two tasks:
   1. Argument retrieval for controversial questions from a focused collection of debates to
      support opinion formation on topics of social importance.
   2. Argument retrieval for comparative questions from a generic web crawl to support in-
      formed decision making.
Approaches to these two tasks that take argumentative quality into account besides topical
relevance will help search engines to deliver more accurate argumentative results. Addition-
ally, they will also be an important part of open-domain conversational agents that “discuss”
controversial societal topics with humans—as showcased by IBM’s Project Debater [6, 7, 8].3
   The teams that participated in the second year of Touché were able to use the topics and
relevance judgments from the first year to develop their approaches. Many trained and optimized
learning-based rankers as part of their retrieval pipelines and employed a large variety of pre-
processing methods (e.g., stemming, duplicate removal, query expansion), argument quality
    1
       The name of the lab is inspired by the usage of the term ‘touché’ as an exclamation “used to admit that
someone has made a good point against you in an argument or discussion.” [https://dictionary.cambridge.org/
dictionary/english/touche]
     2
       https://touche.webis.de/
     3
       https://www.research.ibm.com/artificial-intelligence/project-debater/
features, or comparative features (e.g., credibility, part-of-speech tags). Overall, different to the
first Touché lab, the majority of the submitted approaches improved over the argumentation-
agnostic DirichletLM and BM25F-based baselines. In this paper, we review the participants’
approaches in depth and cover all runs in the evaluation results.


2. Previous Work
Queries in argument retrieval often are phrases that describe a controversial topic, questions
that ask to compare two options, or even complete arguments themselves [9]. In the Touché
lab, we address the first two types in two different shared tasks. Here, we briefly summarize the
related work on argument retrieval and on retrieval in comparative scenarios.

2.1. Argument Retrieval
Argument retrieval aims for delivering arguments to support users in making a decision or to
help persuading an audience of a specific point of view. An argument is usually modeled as a
conclusion with supporting or attacking premises [2]. While a conclusion is a statement that
can be accepted or rejected, a premise is a more grounded statement (e.g., a statistical evidence).
   The development of an argument search engine is faced with challenges that range from
mining arguments from unstructured text to assessing their relevance and quality [2]. Argument
retrieval follows several paradigms that start from different sources and perform argument
mining and retrieval tasks in different orders [10]. Wachsmuth et al. [2], for instance, extract
arguments offline using heuristics that are tailored to online debate portals. Their argument
search engine args.me uses BM25F to rank the indexed arguments while giving conclusions
more weight than premises. Also Levy et al. [11] use distant supervision to mine arguments
offline for a set of topics from Wikipedia before ranking them. Following a different paradigm,
Stab et al. [3] retrieve documents from the Common Crawl4 in an online fashion (no prior offline
argument mining) and use a topic-dependent neural network to extract arguments from the
retrieved documents at query time. With the two Touché tasks, we address the paradigms of
Wachsmuth et al. [2] (Task 1) and Stab et al. [3] (Task 2), respectively.
   Argument retrieval should rank arguments according to their topical relevance but also to their
quality. What makes a good argument has been studied since the time of Aristotle [12]. Recently,
Wachsmuth et al. [13] categorized the different aspects of argument quality into a taxonomy
that covers three dimensions: logic, rhetoric, and dialectic. Logic concerns the local structure
of an argument, i.e, the conclusion and the premises and their relations. Rhetoric covers the
effectiveness of the argument in persuading an audience with its conclusion. Dialectic addresses
the relations of an argument to other arguments on the topic. For example, an argument that has
many attacking premises might be rather vulnerable in a debate. The relevance of an argument
to a query’s topic is categorized by Wachsmuth et al. [13] under dialectic quality.
   Researchers assess argument relevance by measuring an argument’s similarity to a query’s
topic or incorporating its support/attack relations to other arguments. Potthast et al. [14] evalu-
ate four standard retrieval models at ranking arguments with regard to the quality dimensions

   4
       http://commoncrawl.org
of relevance, logic, rhetoric, and dialectic. One of the main findings is that DirichletLM is
better at ranking arguments than BM25, DPH, and TF-IDF. Gienapp et al. [15] extend this work
by proposing a pairwise strategy that reduces the costs of crowdsourcing argument retrieval
annotations in a pairwise fashion by 93% (i.e., annotating only a small subset of argument pairs).
   Wachsmuth et al. [16] create a graph of arguments by connecting two arguments when
one uses the other’s conclusion as a premise. Later on, they exploit this structure to rank the
arguments in the graph using PageRank scores [17]. This method is shown to outperform several
baselines that only consider the content of the argument and its local structure (conclusion and
premises). Dumani et al. [18] introduce a probabilistic framework that operates on semantically
similar claims and premises. The framework utilizes support/attack relations between clusters
of premises and claims and between clusters of claims and a query. It is found to outperform
BM25 in ranking arguments. Later, Dumani and Schenkel [19] also proposed an extension of the
framework to include the quality of a premise as a probability by using the fraction of premises
that are worse with regard to the three quality dimensions of cogency, reasonableness, and
effectiveness. Using a pairwise quality estimator trained on the Dagstuhl-15512 ArgQuality
Corpus [20], their probabilistic framework with the argument quality component outperformed
the one without it on the 50 Task 1 topics of Touché 2020.

2.2. Retrieval for Comparisons
Comparative information needs in web search have first been addressed by basic interfaces where
two to-be-compared products are entered separately in a left and a right search box [21, 22].
Comparative sentences are then identified and mined from product reviews in favor or against
one or the other to-be-compared option using opinion mining approaches [23, 24, 25]. Recently,
the identification of the comparison preference (the “winning” option) in comparative sentences
has been tackled in a more open domain (not just product reviews) by applying feature-based
and neural classifiers [26, 27]. Such preference classification forms the basis of the comparative
argumentation machine CAM [5] that takes two comparison objects and some comparison
aspect(s) as input, retrieves comparative sentences in favor of one or the other object using BM25,
and then classifies the sentences’ preferences for a final merged result table presentation. A
proper argument ranking, however, is still missing in CAM. Chekalina et al. [28] later extended
the system to accept comparative questions as input and to return a natural language answer to
the user. A comparative question is parsed by identifying the comparison objects, aspect(s), and
predicate. The system’s answer is either generated directly based on Transformers [29] or by
retrieval from an index of comparative sentences.


3. Lab Overview and Statistics
The second edition of the Touché lab received 36 registrations (compared to 28 registrations in
the first year), with a majority coming from Germany and Italy, but also from the Americas,
Europe, Africa, and Asia (16 from Germany, 10 from Italy, 2 from the United States and Mexico,
and 1 each from Canada, India, the Netherlands, Nigeria, the Russian Federation, and Tunisia).
Aligned with the lab’s fencing-related title, the participants were asked to select a real or fictional
swordsman character (e.g., Zorro) as their team name upon registration.
   We received result submissions from 27 of the 36 registered teams (up from 17 active teams in
the first year). As in the previous edition of Touché, we paid attention to foster the reproducibil-
ity of the developed approaches by using the TIRA platform [30] that allows easy software
submission and automatic evaluation. Upon registration, each team received an invitation to
TIRA to deploy actual software implementations of their approaches. TIRA is an integrated
cloud-based evaluation-as-a-service research architecture on which participants can install their
software within a dedicated virtual machine. By default, the virtual machines operate the server
version of Ubuntu 20.04 with one CPU (Intel Xeon E5-2620), 4 GB of RAM, and 16 GB HDD,
but we adjusted the resources to the participants’ requirements when needed (e.g., one team
asked for 30 GB of RAM, 3 CPUs, and 30 GB of HDD). The participants had full administrative
access to their virtual machines. Still, we pre-installed the latest versions of reasonable standard
software (e.g., Docker and Python) to simplify the deployment of the approaches.
   Using TIRA, the teams could create result submissions via a click in the web UI that then
initiated the following pipeline: the respective virtual machine is shut down, disconnected
from the internet, and powered on again in a sandbox mode, mounting the test datasets for
the respective Touché tasks, and running a team’s deployed approach. The interruption of
the internet connection ensures that the participants’ software works without external web
services that may disappear or become incompatible—possible causes of reproducibility issues—
but it also means that downloading additional external code or models during the execution
was not possible. We offered our support when this connection interruption caused problems
during the deployment, for instance, with spaCy that tries to download models if they are not
already available on the machine, or with PyTerrier that, in its default configuration, checks for
online updates. To simplify participation of teams that do not want to develop a fully-fledged
retrieval pipeline on their end, we enabled two exceptions from the interruption of the internet
connection for all participants: the APIs of args.me and ChatNoir were available even in the
sandbox mode to allow accessing a baseline system for each of the tasks. The virtual machines
that the participants used for their submissions will be archived such that the respective systems
can be re-evaluated or applied to new datasets as long as the APIs of ChatNoir and args.me
remain available—which are both maintained by us.
   When a software submission in TIRA really was not possible for some reason, the participants
could also simply submit plain run files with their result rankings—an option chosen by 5 of the
27 participating teams. Per task, we allowed each team to submit up to 5 runs whose output
must follow the standard TREC-style format.5 We checked the validity of all submitted run files
and of the run files produced via TIRA, asking participants to resubmit their files or to rerun
their software in case of validity issues—again, also offering our support in case of problems.
All 27 active teams managed to submit at least one valid run. The total of 88 valid runs more
than doubles the 41 valid runs from the first year.


4. Task 1: Argument Retrieval for Controversial Questions
The goal of the Touché 2021 lab’s first task was to advance technologies that support individuals
in forming opinions on socially important controversial topics such as: “Should hate speech
    5
        The expected format was described at the lab’s web page [https://touche.webis.de].
Table 1
Example topic for Task 1: Argument Retrieval for Controversial Questions.

Number            89
Title             Should hate speech be penalized more?
Description       Given the increasing amount of online hate speech, a user questions the necessity
                  and legitimacy of taking legislative action to punish or inhibit hate speech.
Narrative         Highly relevant arguments include those that take a stance in favor of or opposed
                  to stronger legislation and penalization of hate speech and that offer valid reasons
                  for either stance. Relevant arguments talk about the prevalence and impact of hate
                  speech, but may not mention legal aspects. Irrelevant arguments are the ones that
                  are concerned with offensive language that is not directed towards a group or indi-
                  viduals on the basis of their membership in the group.


be penalized more?”. For such topics, the task was to retrieve relevant and high-quality argu-
mentative texts from the args.me corpus [10], a focused crawl of online debate portals. In this
scenario, relevant arguments should help users to form an opinion on the topic and to find
arguments that are potentially useful in debates or discussions.
   The results of last year’s Task 1 participants indicated that improving upon the “classic”
argument-agnostic DirichletLM retrieval model is challenging, but, at the same time, the results
of this baseline still left some room for potential improvements. Also, the detection of the degree
of argumentativeness and the assessment of the quality of an argument were not “solved” in
the first year, but identified as potentially interesting contributions of submissions to the task’s
second edition.

4.1. Task Definition
Given a controversial topic formulated as a question, approaches to Task 1 needed to retrieve
relevant and high-quality arguments from the args.me corpus, which covers a wide range of
timely controversial topics. To enable approaches that leverage training and fine-tuning, the
topics and relevance judgments from the 2020 edition of Task 1 were provided.

4.2. Data Description
Topics. We formulated 50 new search questions on controversial topics. Table 1 shows
an example consisting of a title (i.e., a question on a controversial topic), a description that
summarizes the particular information need and search scenario, and a narrative describing
what makes a retrieved result relevant (meant as a guideline for human assessors). We carefully
selected the topics by clustering the debate titles in the args.me corpus, formulating questions
for a balanced mix of frequent and niche topics—manually ensuring that at least some relevant
arguments are contained in the args.me corpus for each topic.
Document Collection. The document collection for Task 1 was the args.me corpus [10];
freely available for download6 and also accessible via the args.me API.7 The corpus contains
about 400,000 structured arguments crawled from several debate portals (debatewise.org, ide-
bate.org, debatepedia.org, and debate.org), each with a conclusion (claim) and one or more
supporting or attacking premises (reasons).

4.3. Judgment Process
The teams’ result rankings should be formatted in the “standard” TREC format where document
IDs are sorted by descending relevance score for each search topic. Prior to creating the
assessment pools, we ran a near-duplicate detection for all submitted runs using the CopyCat
framework [31], since near-duplicates might impact evaluation results [32, 33]. The framework
found only 1.1% of the arguments in the top-5 results to be near-duplicates (mostly due to debate
portal users reusing their arguments in multiple debate threads). We created duplicate-free
versions of each result list by removing the documents for which a higher-ranked document is a
near-duplicate; in such cases, the next ranked non-near-duplicate then just moved up the ranked
list. The top-5 results of the original and the deduplicated runs then formed the judgment
pool—created with TrecTools [34]—resulting in 3,711 unique documents that were manually
assessed with respect to their relevance and their argumentative quality.
    For the assessment, we used the Doccano tool [35] and followed previously suggested anno-
tation guidelines [15, 14]. Our eight graduate and undergraduate student volunteers (all with
a computer science background) assessed each argument’s relevance to the given topic with
four labels (0: not relevant, 1: relevant, 2: highly relevant, or -2: spam) and the argument’s
rhetorical quality [20] with three labels (0: low quality, 1: sufficient quality, and 2: high qual-
ity). To calibrate the annotators’ interpretations of the guidelines (i.e., the topics including the
narratives and instructions on argument quality), we conducted an initial kappa test in which
each annotator had to label the same 15 arguments from 3 topics (5 arguments from each topic).
The observed Fleiss’ 𝜅 values of 0.50 for argument relevance (moderate agreement) and of 0.39
for argument quality (fair agreement) are similar to previous studies [15, 36, 20]. However, we
still had a follow-up discussion with all the annotators to clarify potential misinterpretations.
Afterwards, each annotator independently judged the results for disjoint subsets of the topics
(i.e., each topic was judged by one annotator only).

4.4. Submitted Approaches and Results
Twenty-one participating teams submitted at least one valid run to Task 1. The submissions
partly continued the trend of Touché 2020 [37] by deploying “classical” retrieval models, how-
ever, with an increased focus on machine learning models (especially for query expansion and
for assessing argument quality). Overall, we observed two kinds of contributions: (1) Repro-
ducing and fine-tuning approaches from the previous year by increasing their robustness, and
(2) developing new, mostly neural approaches for argument retrieval by fine-tuning pre-trained
models for the domain-specific search task at hand.
   6
       https://webis.de/data.html#args-me-corpus
   7
       https://www.args.me/api-en.html
   Like in the first year, combining “classical” retrieval models with various query expansion
methods and domain-specific re-ranking features remained a frequent choice of approaches to
Task 1. Not really surprising—given last year’s baseline results—DirichletLM was employed most
often as the initial retrieval model, followed by BM25. For query expansion, most participating
teams continued to leverage WordNet [38]. However, transformer-based approaches received
increased attention, such as query hallucination, which was successfully used by Akiki and
Potthast [39] in the previous Touché lab. Similarly, utilizing deep semantic phrase embeddings
to calculate the semantic similarity between a query and possible result documents gained
widespread adoption. Moreover, many approaches tried to use some form of argument quality
estimation as one of their features for ranking or re-ranking.
   This year’s approaches benefited from the judgments released for Touché in 2020. Many
teams used them for general parameter optimization but also to evaluate intermediate results of
their approaches and to fine-tune or select the best configurations. For instance, comparing
different kinds of pre-processing methods based on the available judgments from last year
received much attention (e.g., stopword lists, stemming algorithms, or duplicate removal).
   The results of the runs with the best nDCG@5 scores per participating team are reported
in Table 2 (cf. Appendix A for evaluation results of all submitted runs). Below, we review the
participants’ approaches submitted to Task 1, ordered alphabetically by team name8
   Asterix [40] preprocesses the args.me corpus by removing duplicate documents and filtering
out documents that are too short. The resulting dataset is indexed using BM25. Then a
linear regression model on the Webis-ArgQuality-20 argument quality dataset [15] is trained,
predicting a given argument’s overall quality. At retrieval time, the topic query is expanded
using WordNet-based query expansion, 1,000 documents are retrieved using the BM25 index,
and then re-ranked using a weighted combination of the normalized predicted quality score and
the normalized BM25 score. They optimize the weighting against nDCG@5 using the relevance
judgments from Touché 2020. A total of five runs were submitted.
   Athos uses a DirichletLM retrieval model with a 𝜇 value of 2,000 and indexes the fields of an
argument (conclusion and premise) separately. Both fields get preprocessed by lower-casing
and removing stop words, urls, and emails. The ranking scores for both fields are then weighted
as follows: 0.1 for conclusion and 0.9 for premise. A single run was submitted.
   Blade uses a DirichletLM retrieval model in one run, and two variations of a BM25-based
retrieval in two further runs. Unfortunately, no further details have been provided.
   Batman [41] sets out to quantify the contributions of various steps of a retrieval pipeline, using
argument retrieval as their proving ground. A finite search space is defined and effectiveness is
systematically measured as more modules are added to the retrieval pipeline. Using relevance
judgments from Touché 2020, the best combination of similarity function and tokenizer is
determined, and then, gradually, different modules are added, valuate, and frozen, such as
different stop word lists, different stemmers, and different filtering approaches. This amounts to
a comprehensive grid search in hyperparameter space that allowed for choosing better-working
components over worse ones for the retrieval pipeline, and provided for a good comparative
overview of them. A total of three runs were submitted.

    8
     Nine teams participated in Task 1 with valid runs, but did not submit a notebook describing their approach.
Their methodology is summarized in short here, after consulting with the respective team members. Blade and
Palpatine did not provide further information.
Table 2
Results for Task 1: Argument Retrieval for Controversial Questions. The left part (a) shows the eval-
uation results of a team’s best run according to the results’ relevance, while the right part (b) shows
the best runs according to the results’ quality. An asterisk (⋆ ) indicates that the runs with the best
relevance and the best quality differ for a team. The baseline DirichletLM ranking is shown in bold.
(a) Best relevance score per team                    (b) Best quality score per team
Team                            nDCG@5               Team                              nDCG@5
                          Relevance      Quality                                Quality     Relevance
       ⋆                                                      ⋆
Elrond                      0.720         0.809      Heimdall                    0.841        0.639
Pippin Took⋆                0.705         0.798      Skeletor⋆                   0.827        0.666
Robin Hood⋆                 0.691         0.756      Asterix⋆                    0.818        0.663
Asterix⋆                    0.681         0.802      Elrond⋆                     0.817        0.674
Dread Pirate Roberts⋆       0.678         0.804      Pippin Took⋆                0.814        0.683
Skeletor⋆                   0.667         0.815      Goemon Ishikawa             0.812        0.635
Luke Skywalker              0.662         0.808      Hua Mulan⋆                  0.811        0.620
Shanks⋆                     0.658         0.790      Dread Pirate Roberts⋆       0.810        0.647
Heimdall⋆                   0.648         0.833      Yeagerists                  0.810        0.625
Athos                       0.637         0.802      Robin Hood⋆                 0.809        0.641
Goemon Ishikawa             0.635         0.812      Luke Skywalker              0.808        0.662
Jean Pierre Polnareff       0.633         0.802      Macbeth⋆                    0.803        0.608
Swordsman                   0.626         0.796      Athos                       0.802        0.637
Yeagerists                  0.625         0.810      Jean Pierre Polnareff       0.802        0.633
Hua Mulan⋆                  0.620         0.789      Swordsman                   0.796        0.626
Macbeth⋆                    0.611         0.783      Shanks⋆                     0.795        0.639
Blade⋆                      0.601         0.751      Blade⋆                      0.763        0.588
Deadpool                    0.557         0.679      Little Foot                 0.718        0.521
Batman                      0.528         0.695      Batman                      0.695        0.528
Little Foot                 0.521         0.718      Deadpool                    0.679        0.557
Gandalf                     0.486         0.603      Gandalf                     0.603        0.486
Palpatine                   0.401         0.562      Palpatine                   0.562        0.401


   Deadpool applies a query expansion technique with a DirichletLM model (𝜇=4000). Both the
conclusion and the premise of an argument are indexed, with 0.1 and 0.9 weights, respectively.
The query expansion technique relies on the top-5 arguments to derive terms that associated
with the query term. To quantify the co-occurrence of a term in an argument with the query
terms, its conditional probability to that of the query terms are calculated and smoothed by the
term’s inverse document frequency. The conditional probability of a term given a query term
is calculated using the count of arguments that contain both terms, divided by the count of
arguments that contains the query term. A single run was submitted.
   Dread Pirate Roberts [42] uses four classes of approaches to retrieve relevant arguments
from the args.me corpus for a query on a controversial topic. Therefore, Roberts contrasts
two “traditional” approaches with two novel approaches. The traditional approaches involve
one run that uses a Dirichlet-smoothed language-model with low-quality arguments removed
by argument clustering with the Universal Sentence Encoder model [43], and two feature-
based learning to rank approaches with LambdaMART [44]. The learning to rank models are
trained on the relevance labels of Task 1 of Touché 2020 and differ in the used features. With
31 features belonging to 5 different feature classes as starting point, Roberts runs a greedy
feature-selection identifying a subset of 4 and 9 features with best nDCG scores in a five-fold
cross-validation setup. Afterwards, both feature sets are used on all relevance labels of Task 1
of Touché 2020 to train dedicated LambdaMART models that re-rank the top-100 results of the
DirichletLM retrieval, producing 2 LambdaMART runs. Roberts further submits one run that
re-ranks the top-100 results of the DirichletLM retrieval with a question-answering model. The
idea behind this run is to phrase the task to retrieve relevant arguments for a controversial
query as deciding whether an argument “answers” the controversial query. Therefore, the
question-answering retrieval model coming with the Universal Sentence Encoder scores the
top-100 argument for a query whether the argument "answers" the query or not, sorting the
arguments by descending question-answering score. The fifth run submitted by Dread Pirate
Roberts uses transformer-based query expansion where the query is expanded with keywords
generated with RoBERTa [45]. Therefore, Dread Pirate Roberts embedded the controversial
query into a pattern letting RoBERTa predict tokens, expanding the query with the top-10
tokens and their RoBERTa score as a weighted query submitted to the DirichletLM retrieval
model. A total of five runs were submitted.
   Elrond focuses on implementing a document analyzing pipeline to be used together with a
DirichletLM-based retrieval. They rely on the Krovetz stemming algorithm and remove stop
words using a custom stop list. They also compute part-of-speech tags and remove tokens from
documents by filtering out certain tags. Documents are further enriched using WordNet-based
synonyms. A total of four runs were submitted.
   Gandalf indexes for each argument only the conclusion and uses BM25 as a retrieval model
in a single-run submission.
   Goemon Ishikawa [46] explores different configurations of a standard Lucene-based retrieval
pipeline, varying the similarity function (BM25, DirichletLM), tokenizers (Lucene, OpenNLP),
stop word lists (Lucene, Atire, Terrier, Smart), and lemmatizers (OpenNLP). Additionally, they
test query expansion with synonyms from WordNet. Thirteen such configurations were evalu-
ated on topics from Touché 2020 with respect to average precision, precision@10, nDCG, and
nDCG@5. In an analysis of variance, they observe overall high variances for all evaluation
measures and configurations, and that DirichletLM-based configurations perform significantly
better, however, the effect of different tokenizers, stop word lists, or lemmatizers could not be
assessed conclusively. A manual analysis by the authors on two topics suggests that expanding
the query with synonyms can possibly drift the query. Using five DirichletLM models, two of
which expand the query, and non apply lemmatization, a total of five runs were submitted.
   Heimdall [47] aims at including both topical relevance and argument quality while ranking
arguments. As a basic retrieval model, DirichletLM is used. The basic retrieval model is
considered to give a mere textual relevance. To assess the topical relevance of an argument,
arguments are embedded using the Universal Sentence Encoder and then clustered using k-
means with 𝑘 = 300. Arguments are then represented using their cluster centroids and the
topical relevance of an argument is calculated using the cosine similarity of the query to the
centroid. Argument quality is assessed using a support vector regression model that is trained
on the Webis-ArgQuality-20 corpus. The regression model achieves a mean squared error of 0.19.
Before assessing the quality of arguments, an argumentativeness classifier is used to filter input
instances that are not arguments. The support vector machine classifier is also trained on the
same dataset and achieves an F1-score of 0.88. A total of five runs were submitted.
   Hua Mulan [48] proposes to expand documents from the args.me corpus prior to indexing,
evaluating how different expansion methods affect the argument retrieval for controversial
topics. Three expansion approaches are presented: the first uses a transformer-based query
prediction to generate queries based on the premises and conclusions as input, which are then
added to the documents. The second is also transformer-based and generates (“hallucinates”)
arguments using GPT-2 based on the conclusions. The third approach uses TF-IDF to determine
the top-10 keywords and expands the premises using synonyms from the WordNet database.
For evaluation, all corpora were indexed and retrieved using Elasticsearch and the DirichletLM
similarity. The altered args.me corpus with expansions is made available as dataset. A total of
three runs were submitted.
   Jean-Pierre Polnareff [49] combines differently weighted versions of the BM25 and Dirich-
letLM retrieval model with a WordNet-based query expansion, and a re-ranking component that
incorporates sentiment analysis to explore whether boosting arguments with high sentiment
scores or boosting neutral arguments leads to better results. The authors provide an ablative
evaluation study for each of these three components, motivating their parameter choice at each
step. Furthermore, different text pre-processing steps were reviewed in-depth, evaluating the
effect of the choice of stop word list and stemming algorithm on the final result. A single run
was submitted.
   Little Foot applies a query expansion technique over an Okapi BM25 model. The team indexes
three fields for each argument: conclusion, premise, and context. Preprocessing the three fields
includes lemmatization and removing stop words. The query expansion technique expands
nouns, adjectives, and adverbs in the query with synonyms from WordNet. When multiple
meanings exist for a word (known as “synset” in WordNet jargon), the approach uses the Lesk
algorithm [50] to disambiguate the meaning of the word based on the context. A single run was
submitted.
   Luke Skywalker indexes for each argument its premise, conclusion, and context. As a retrieval
model they implemented their own tf ·idf model in a single-run submission.
   Macbeth [51] describes an approach that utilizes fine-tuned SBERT sentence embeddings [52]
in conjunction with different retrieval strategies. First, further pre-training of the RoBERTa
model on the args.me corpus with annotated relevance labels is carried out. They then obtain
sentence embeddings of all documents in the args.me corpus with SBERT based on the pre-
trained model. Weakly supervised data-augmentation is used to fine-tune the bi-encoder further,
based on labels inferred using a cross-encoder architecture. Three retrieval strategies are then
applied: (1) approximate nearest-neighbor vector retrieval on the inferred document embeddings,
(2) BM25, and (3) a mixture of both. An initial retrieved pool of candidate documents is re-
ranked by direct query/document comparison using a cross-encoder architecture. The authors
experiment with different pipeline configurations. A total of five runs were submitted.
   Palpatine, befittingly, submitted one of the worst-performing of all runs, without providing
any explanation whatsoever.
   Pippin Took [53] first preprocesses documents with the Krovetz Stemmer [54], and remove
stop words using a custom stop word list curated from various libraries. After parameter-tuning
Lucene’s implementation of DirichletLM using the Touché 2020 relevance labels, they then
experiment with two different retrieval pipelines: (1) query expansion with WordNet, and
(2) phrase search with term trigrams, which follows the idea that arguments containing parts
of the query as phrases will be part of an effective argumentative ranking. Therefore, the
arguments are indexed as term trigrams, and each query is split into term trigrams to retrieve
arguments with DirichletLM. However, preliminary experiments suggested that argument
retrieval with term trigrams substantially decreases the nDCG@5. Hence, Took omits phrase
search and submits three runs with DirichletLM only, and two runs with DirichletLM and query
expansion, varying the parameter 𝜇 of DirichletLM, for a total of five runs.
   Robin Hood relies on the RM3 implementation from the Pyserini toolkit [55] to perform query
expansion. For retrieval, they embed both the premise and the conclusion of each argument
into two separate vector spaces using the Universal Sentence Encoder, ranking arguments based
on the cosine similarity between embedded query and document. The two embeddings are
incorporated with different weights. They further take document length into account, deducting
up to 15% of an arguments score if its length lies outside of one standard deviation of the mean
across the whole corpus. They submit one baseline run using the DirichletLM retrieval model,
one with RM3 query expansion applied on top of that, one using only cosine similarity on
phrase embeddings, and one using RM3 in conjunction with phrase embeddings for retrieval,
for a total of four runs.
   Shanks [56] indexes discussion titles in addition to the premises and conclusions in the
args.me corpus. They construct a custom stop word list based on the Smart and Lucene lists, as
well as frequent terms within the document collection. They then use a Boolean model with
the individual terms of the query to apply boosts to the indexed documents. Each matched
term between query and discussion titles, conclusions, and premises in the corpus, as well as
all identified WordNet synonyms of query terms receive a boosting factor. Both BM25 and
DirichletLM are then used to retrieve relevant documents, with boosting applied. Additionally, a
proximity search for all term pairs within the query can be performed and boosted individually.
A total of five runs were submitted.
   Skeletor [57] submits five runs using three different approaches: (1) BM25 retrieval, (2) ranking
arguments based on their semantic similarity to the query, and (3) using pseudo relevance
feedback in combination with the semantic similarity of passages. Unanimously, the arguments’
premise is used for ranking. The BM25 approach uses Pyserini with the BM25 parameters
𝑘1 and 𝑏 fine-tuned with grid search on the relevance judgments from Touché 2020. The
two semantic similarity runs use the model msmarco-distilbert-base-v3 provided by Sentence
Transformers [52], which was fine-tuned for question-answering on MS MARCO [58]. Therefore,
arguments are split by sentence into passages of approximately 200 words, using the maximum
cosine similarity of all passages in the argument to the encoded query as retrieval score. The
submitted runs differ as follows: Run 1 ranks documents solely by their semantic similarity to
the query using approximate nearest neighbor search; Runs 2 and 3 interpolate the semantic
similarity score with the tuned BM25 scores; Runs 4 and 5 use the top-3 arguments retrieved by
the interpolation of BM25 with the semantic similarity score as pseudo relevance feedback: for
each passage from the relevance feedback, the 50 most similar passages are identified with an
approximate nearest neighbor search on all encoded passages of the corpus. The probabilities
that a passage is highly similar to a passage in the pseudo relevance feedback are determined
with manifold approximation and summed as the argument’s score. In Run 4 all arguments in
the corpus are ranked with this score, and in Run 5 only the top-10 results of the interpolation
of BM25 with the semantic similarity are re-ranked.
   The baseline run of Swordsman encompasses two separate approaches: the Elasticsearch im-
plementation of query likelihood with Dirichlet-smoothed language models (DirichletLM [59]),
as well as the args.me API.
   The Yeagerists [60] describe an approach that integrates two components: query expansion
and argument quality regression. Query expansion is performed using a pretrained BERT model
which is prompted to substitute certain masked words (adjectives, nouns, and past participles)
in the topics. Argument quality regression is performed by training a BERT as a regression
model on Webis-ArgQuality-20. The regression model is trained in a 8:1:1 split using mean
squared error (MSE) as a loss function, and achieves an MSE of 0.728 on the test split. At
retrieval time, for each topic, ten queries are generated using the lexical substitution algorithm
and then forwarded to a DirichletLM retrieval model to produce a relevance score. The top-100
arguments are then passed to the regression model to predict their quality score. The relevance
score and quality score are normalized and averaged with a weighting variable 𝛼 that controls
the contribution of the quality score to the averaged score. The team tests different 𝛼-values
using the relevance labels from Touché 2020 to motivate parameter choices for their submitted
runs. A total of five runs were submitted.


5. Task 2: Argument Retrieval for Comparative Questions
The goal of the Touché 2021 lab’s second task was to support individuals making informed
decisions in “everyday” or personal comparison situations—in its simplest form for questions
such as “Is X or Y better for Z?”. Decision making in such situations benefits from finding
balanced justifications for choosing one or the other option, for instance, via an overview of
relevant and high-quality pro/con arguments.
   Similar to Task 1, the results of last year’s Task 2 participants indicated that improving upon
an argument-agnostic BM25F baseline is challenging. Promising proposed approaches tried to
re-rank based on features capturing “comparativeness” or “argumentativeness.”

5.1. Task Definition
Given a comparative question, an approach to Task 2 needed to retrieve documents from the
general web crawl ClueWeb129 that help to come to an informed decision on the comparison.
Ideally, the retrieved documents should be argumentative with convincing arguments for or
against one or the other option. To identify arguments in web documents, the participants
were not restricted to any system; they could use own technology or any existing argument
taggers such as MARGOT [61]. To lower the entry barriers for participants new to argument
mining, we offered support for using the neural argument tagger TARGER [4], hosted on our
own servers and accessible via an API.10


   9
        https://lemurproject.org/clueweb12/
   10
        https://demo.webis.de/targer-api/apidocs/
Table 3
Example topic for Task 2: Argument Retrieval for Comparative Questions.

Number                88
Title                 Should I major in philosophy or psychology?
Description           A soon-to-be high-school graduate finds themself at a crossroad in their live. Based
                      on their interests, majoring in philosophy or in psychology are the potential options
                      and the graduate is searching for information about the differences and similarities,
                      as well as advantages and disadvantages of majoring in either of them (e.g., with
                      respect to career opportunities or gained skills).
Narrative             Relevant documents will overview one of the two majors in terms of career prospects
                      or developed new skills, or they will provide a list of reasons to major in one or the
                      other. Highly relevant documents will compare the two majors side-by-side and help
                      to decide which should be preferred in what context. Not relevant are study program
                      and university advertisements or general descriptions of the disciplines that do not
                      mention benefits, advantages, or pros/cons.


5.2. Data Description
Topics. For the second edition of Task 2, we manually selected 50 new comparative questions
from the MS MARCO dataset [58] (questions from Bing’s search logs) and the Quora dataset [62]
(questions asked on the Quora question answering website). We ensured to have questions on
diverse topics, for example, asking about electronics, cuisine, house appliances, life choices, etc.
Table 3 shows an example topic for Task 2 that consists of a title (i.e., a comparative question), a
description of the possible search context and situation, and a narrative describing what makes
a retrieved result relevant (meant as a guideline for human assessors). In the topic selection, we
ensured that relevant documents for each topic were actually contained in the ClueWeb12 (i.e.,
avoiding questions on comparison options not known at the ClueWeb12 crawling time in 2012).

Document Collection. The document collection was formed by the ClueWeb12 dataset that
contains 733 million English web pages (27.3 TB uncompressed), crawled by the Language
Technologies Institute at Carnegie Mellon University between February and May 2012. For
participants of Task 2 who could not index the ClueWeb12 at their site, we provided access to
the indexed corpus through the BM25F-based search engine ChatNoir [63] via its API.11

5.3. Judgment Process
Using the CopyCat framework [31], we found that, on average, 11.6% of the documents in
the top-5 results of a run were near-duplicates—a non-negligible redundancy that might have
negatively impacted the reliability and validity of our evaluation since rankings containing
multiple relevant duplicates tend to overestimate the actual retrieval effectiveness [32, 33].
   11
        https://www.chatnoir.eu/doc/
Following the strategy used in Task 1, we pooled the top-5 documents from the original and the
deduplicated runs, resulting in 2,076 unique documents that needed to be judged.
   Our eight volunteer annotators (same as for Task 1) labeled a document for its topical relevance
(three labels; 0: not relevant, 1: relevant, and 2: highly relevant) and for whether rhetorically
well-written arguments [20] were contained (three labels; 0: low quality or no arguments in
the document, 1: sufficient quality, and 2: high quality). Similar to Task 1, our eight volunteer
assessors went through an initial kappa test on 15 documents from 3 topics (5 documents
per topic). As in case of Task 1, the observed Fleiss’ 𝜅 values of 0.46 for relevance (moderate
agreement) and of 0.22 for quality (fair agreement) are similar to previous studies [15, 36, 20].
Again, however, we had a follow-up discussion with all the annotators to clarify some potential
misinterpretations. Afterwards, each annotator independently judged the results for disjoint
subsets of the topics (i.e., each topic was judged by one annotator only).

5.4. Submitted Approaches and Results
For Task 2, six teams submitted approaches that all used ChatNoir for an initial document
retrieval, either by submitting the original topic titles as queries, or by applying query prepro-
cessing (e.g., lemmatization and POS-tagging) and query expansion techniques (e.g., synonyms
from WordNet [38], or generated with word2vec [64] or sense2vec embeddings [65]). On the
retrieved ChatNoir results, most teams then applied a document “preprocessing” (e.g., removing
HTML markup) before re-ranking with feature-based or neural classifiers trained on last year’s
judgments with, for instance, argumentativeness, credibility, or comparativeness scores as
features. The teams predicted document relevance labels by using a random forest classifier,
XGBoost [66], LightGBM [67], or a fine-tuned BERT [29]. The results of the runs with the best
nDCG@5 scores per participating team are reported in Table 4 (cf. Appendix A for the evaluation
results of all submitted runs). Below, we give an overview of the approaches submitted to Task 2,
ordered alphabetically by team name.12
   Jack Sparrow [68] lemmatizes the question queries in a preprocessing step, creates expansion
terms by detecting “comparison” terms in the questions (e.g., nouns or comparative adjec-
tives/adverbs as identified by spaCy’s POS tagger13 ), and identifies synonyms of these terms
from WordNet synsets [38], from word2vec [64], and sense2vec embeddings [65]. The top-100
ChatNoir results returned for the preprocessed and expanded questions are then re-ranked
by a support vector regression trained on the Touché 2020 topics and judgments to predict
relevance scores for the documents using combinations of the following normalized features:
(1) argumentative score (sum of argumentativeness probabilities returned by TARGER for each
token inside premises and claims), (2) (pseudo) trustworthiness score (0–10-valued PageRank
scores obtained from Open PageRank)14 , (3) relevance labels predicted by a BERT-based classifier
fine-tuned on the Touché 2020 topics and judgments, and (4) the ChatNoir relevance score.
Different runs of Sparrow use different combinations of query preprocessing and expansion,
and different feature combinations for the support vector regression; the most effective run

   12
      One team participated in Task 2 with a valid run, but did not submit a notebook describing their approach.
Their methodology is summarized in short here, after consulting with the team members.
   13
      https://spacy.io/
   14
      https://www.domcop.com/openpagerank/what-is-openpagerank
Table 4
Results for Task 2: Argument Retrieval for Comparative Questions. The left part (a) shows the eval-
uation results of a team’s best run according to the results’ relevance, while the right part (b) shows
the best runs according to the results’ quality. An asterisk (⋆ ) indicates that the runs with the best
relevance and the best quality differ for a team. The baseline ChatNoir ranking is shown in bold.
(a) Best relevance score per team                      (b) Best quality score per team
Team                              nDCG@5               Team                            nDCG@5
                          Relevance          Quality                          Quality       Relevance
          ⋆                                                 ⋆
Katana                       0.489            0.675    Rayla                   0.688          0.466
Thor                         0.478            0.680    Katana⋆                 0.684          0.460
Rayla⋆                       0.473            0.670    Thor                    0.680          0.478
Jack Sparrow                 0.467            0.664    Jack Sparrow            0.664          0.467
Mercutio                     0.441            0.651    Mercutio                0.651          0.441
Puss in Boots                0.422            0.636    Puss in Boots           0.636          0.422
Prince Caspian               0.244            0.548    Prince Caspian          0.548          0.244


uses query lemmatization and expansion while the regression is trained on the BERT relevance
predictions, combined with the ChatNoir relevance scores. A total of four runs were submitted.
   Katana [69] re-ranks the top-100 ChatNoir results (original questions as queries) but using
different feature-based and neural classifiers/rankers to predict the final relevance labels: (1) an
XGBoost [66] approach (overall relevance-wise most effective run), (2) a LightGBM [67] ap-
proach (team Katana’s quality-wise best run), (3) Random Forests [70], and (4) a BERT-based
ranker from OpenNIR [71]. The feature-based approaches are trained on the topics and judg-
ments from Touché 2020, employing a range of relevance features (e.g., ChatNoir relevance
score) and “comparativness” features (e.g., number of identified comparison objects, aspects,
or predicates [28]). The BERT-based ranker is trained on the ANTIQUE question-answering
dataset [72] (34,000 text passages with relevance annotations for 2,600 open-domain non-factoid
questions). A total of six runs were submitted (we evaluated all of them since the overall
judgment load was feasible).
   Mercutio [73] expands the original question queries with synonyms obtained from word2vec
embeddings [64] (Mercutio’s best run uses embeddings pre-trained on the Gigaword corpus15 )
or nouns found in GPT-2 [74] extensions when prompted with the question. The respective
top-100 ChatNoir results are then re-ranked based on a linear combination of several scores (e.g.,
term-frequency counts, ratio of premises and claims in documents as identified by TARGER,
etc.). The weights of the individual scores are optimized in a grid search on the Touché 2020
topics and judgments. A total of three runs were submitted.
   Prince Caspian re-ranks the top-40 ChatNoir results returned for the questions without stop
words. The re-ranking uses the results’ main content (extracted with the BoilerPy3 library;16
topic title terms in the extracted main content masked with a “MASK” token) and a logistic
regression classifier (features: tf ·idf -weighted 1- to 4-grams; training on the Touché 2020
topics and judgments) that predicts the probability of a result being relevant (final ranking by
   15
        https://catalog.ldc.upenn.edu/LDC2011T07
   16
        https://pypi.org/project/boilerpy3/
descending probability). A single run was submitted.
   The baseline run of Puss in Boots simply uses the results that ChatNoir [63] returns for the
original question query. ChatNoir is an Elasticsearch-based search engine for the ClueWeb12
(and several other web corpora) that employs BM25F ranking (fields: document title, keywords,
main content, and the full document) and SpamRank scores [75].
   Rayla [76] uses two query processing/expansion techniques: (1) removing stop words and
punctuation, and then lemmatizing the remaining tokens with spaCy, and (2) expanding com-
parative adjectives/adverbs (POS-tagged with spaCy) with a maximum of five synonyms and
antonyms. The final re-ranking is created by linearly combining different scores such as a
ChatNoir’s relevance score, PageRank, and SpamRank (both also returned by ChatNoir), an
argument support score (ratio of argumentative sentences (premises and claims) in documents
found with a custom DistilBERT-based [77] classifier), and a similarity score (averaged cosine
similarity between the original query and every argumentative sentence in the document repre-
sented by Sentence-BERT embeddings [52]). The weights of the individual scores are optimized
in a grid search on the Touché 2020 topics and judgments. A total of four runs were submitted.
   Thor [78] removes, as query preprocessing, any punctuation from the topic titles. They then
locally create an Elasticsearch BM25F index of the top-110 ChatNoir results (fields: original and
lemmatized document title, document body extracted using the BoilerPy3 library, and premises
and claims as identified by TARGER in the body) with the BM25 parameters optimized by a
grid search on the Touché 2020 judgments (𝑏 = 0.68 and 𝑘1 = 1.2). The local index is then
queried with the lemmatized topic title expanded by WordNet synonyms [38]. A single run was
submitted.


6. Summary and Outlook
From the 36 teams that registered for the Touché 2021 lab, 27 actively participated by submitting
at least one valid run to one of the two shared tasks: (1) argument retrieval for controversial
questions, and (2) argument retrieval for comparative questions. Most of the participating teams
used the judgments from the first lab’s edition to train feature-based or neural approaches
that predict argument quality or that re-rank some initial retrieval result set. Overall, many
more approaches could improve upon the argumentation-agnostic baselines (DirichletLM for
Task 1 and BM25F for Task 2) than in the first year, indicating that progress was achieved.
For a potential third year of the Touché lab, we currently plan to focus on retrieving the most
relevant/argumentative text passages and on detecting the pro/con stance of the returned results.


Acknowledgments
We are very grateful to the CLEF 2021 organizers and the Touché participants, who allowed this
lab to happen. We also want to thank Jan Heinrich Reimer for setting up Doccano, Christopher
Akiki for providing the baseline DirichletLM implementation Swordsman, our volunteer anno-
tators who helped to create the relevance and argument quality assessments, and our reviewers
for their valuable feedback on the participants’ notebooks.
  This work was partially supported by the DFG through the project “ACQuA: Answering
Comparative Questions with Arguments” (grants BI 1544/7-1 and HA 5851/2-1) as part of the
priority program “RATIO: Robust Argumentation Machines” (SPP 1999).


References
 [1] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko,
     C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen,                        Overview of
     Touché 2021: Argument Retrieval, in: Proceedings of the 12th International Conference
     of the CLEF Association (CLEF 2021), volume 12880 of Lecture Notes in Computer Science,
     Springer, 2021.
 [2] H. Wachsmuth, M. Potthast, K. A. Khatib, Y. Ajjour, J. Puschmann, J. Qu, J. Dorsch, V. Morari,
     J. Bevendorff, B. Stein, Building an Argument Search Engine for the Web, in: Proceedings
     of the 4th Workshop on Argument Mining (ArgMining@EMNLP 2017), Association for
     Computational Linguistics, 2017, pp. 49–59. URL: https://doi.org/10.18653/v1/w17-5106.
 [3] C. Stab, J. Daxenberger, C. Stahlhut, T. Miller, B. Schiller, C. Tauchmann, S. Eger, I. Gurevych,
     ArgumenText: Searching for Arguments in Heterogeneous Sources, in: Proceedings of
     the Conference of the North American Chapter of the Association for Computational
     Linguistics (NAACL-HLT 2018), Association for Computational Linguistics, 2018, pp. 21–25.
     URL: https://doi.org/10.18653/v1/n18-5005.
 [4] A. Chernodub, O. Oliynyk, P. Heidenreich, A. Bondarenko, M. Hagen, C. Biemann,
     A. Panchenko, TARGER: Neural Argument Mining at Your Fingertips, in: Pro-
     ceedings of the 57th Annual Meeting of the Association for Computational Linguis-
     tics (ACL 2019), Association for Computational Linguistics, 2019, pp. 195–200. URL:
     https://www.aclweb.org/anthology/P19-3031.
 [5] M. Schildwächter, A. Bondarenko, J. Zenker, M. Hagen, C. Biemann, A. Panchenko, An-
     swering Comparative Questions: Better than Ten-Blue-Links?, in: Proceedings of the
     Conference on Human Information Interaction and Retrieval (CHIIR 2019), Association for
     Computing Machinery, 2019, pp. 361–365. URL: https://doi.org/10.1145/3295750.3298916.
 [6] R. Bar-Haim, L. Eden, R. Friedman, Y. Kantor, D. Lahav, N. Slonim, From Arguments to
     Key Points: Towards Automatic Argument Summarization, in: Proceedings of the 58th
     Annual Meeting of the Association for Computational Linguistics (ACL 2020), Association
     for Computational Linguistics, 2020, pp. 4029–4039. URL: https://doi.org/10.18653/v1/2020.
     acl-main.371.
 [7] R. Bar-Haim, D. Krieger, O. Toledo-Ronen, L. Edelstein, Y. Bilu, A. Halfon, Y. Katz,
     A. Menczel, R. Aharonov, N. Slonim, From Surrogacy to Adoption; From Bitcoin to
     Cryptocurrency: Debate Topic Expansion, in: Proceedings of the 57th Conference of the
     Association for Computational Linguistics (ACL 2019), Association for Computational
     Linguistics, 2019, pp. 977–990. URL: https://doi.org/10.18653/v1/p19-1094.
 [8] Y. Mass, S. Shechtman, M. Mordechay, R. Hoory, O. S. Shalom, G. Lev, D. Konopnicki, Word
     Emphasis Prediction for Expressive Text to Speech, in: Proceedings of the 19th Annual
     Conference of the International Speech Communication Association (Interspeech 2018),
     ISCA, 2018, pp. 2868–2872. URL: https://doi.org/10.21437/Interspeech.2018-1159.
 [9] H. Wachsmuth, S. Syed, B. Stein, Retrieval of the Best Counterargument without Prior
     Topic Knowledge, in: Proceedings of the 56th Annual Meeting of the Association for
     Computational Linguistics (ACL 2018), Association for Computational Linguistics, 2018,
     pp. 241–251. URL: https://www.aclweb.org/anthology/P18-1023/.
[10] Y. Ajjour, H. Wachsmuth, J. Kiesel, M. Potthast, M. Hagen, B. Stein, Data Acquisi-
     tion for Argument Search: The args.me Corpus, in: Proceedings of the 42nd German
     Conference on Artificial Intelligence (KI 2019), Springer, 2019, pp. 48–59. doi:10.1007/
     978-3-030-30179-8\_4.
[11] R. Levy, B. Bogin, S. Gretz, R. Aharonov, N. Slonim, Towards an Argumentative Con-
     tent Search Engine using Weak Supervision, in: Proceedings of the 27th International
     Conference on Computational Linguistics (COLING 2018), Association for Computational
     Linguistics, 2018, pp. 2066–2081. URL: https://www.aclweb.org/anthology/C18-1176/.
[12] Aristotle, G. A. Kennedy, On Rhetoric: A Theory of Civic Discourse, Oxford: Oxford
     University Press, 2006.
[13] H. Wachsmuth, N. Naderi, I. Habernal, Y. Hou, G. Hirst, I. Gurevych, B. Stein, Argu-
     mentation Quality Assessment: Theory vs. Practice, in: Proceedings of the 55th Annual
     Meeting of the Association for Computational Linguistics (ACL 2017), Association for
     Computational Linguistics, 2017, pp. 250–255. URL: https://doi.org/10.18653/v1/P17-2039.
[14] M. Potthast, L. Gienapp, F. Euchner, N. Heilenkötter, N. Weidmann, H. Wachsmuth, B. Stein,
     M. Hagen, Argument Search: Assessing Argument Relevance, in: Proceedings of the
     42nd International Conference on Research and Development in Information Retrieval
     (SIGIR 2019), Association for Computing Machinery, 2019, pp. 1117–1120. URL: https:
     //doi.org/10.1145/3331184.3331327.
[15] L. Gienapp, B. Stein, M. Hagen, M. Potthast, Efficient Pairwise Annotation of Argument
     Quality, in: Proceedings of the 58th Annual Meeting of the Association for Computational
     Linguistics (ACL 2020), Association for Computational Linguistics, 2020, pp. 5772–5781.
     URL: https://www.aclweb.org/anthology/2020.acl-main.511/.
[16] H. Wachsmuth, B. Stein, Y. Ajjour, "PageRank" for Argument Relevance, in: Proceedings
     of the 15th Conference of the European Chapter of the Association for Computational
     Linguistics (EACL 2017), Association for Computational Linguistics, 2017, pp. 1117–1127.
     URL: https://doi.org/10.18653/v1/e17-1105.
[17] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing
     Order to the Web., Technical Report 1999-66, Stanford InfoLab, 1999. URL: http://ilpubs.
     stanford.edu:8090/422/.
[18] L. Dumani, P. J. Neumann, R. Schenkel, A Framework for Argument Retrieval - Ranking
     Argument Clusters by Frequency and Specificity, in: Proceedings of the 42nd European
     Conference on IR Research (ECIR 2020), volume 12035 of Lecture Notes in Computer Science,
     Springer, 2020, pp. 431–445. URL: https://doi.org/10.1007/978-3-030-45439-5_29.
[19] L. Dumani, R. Schenkel, Quality Aware Ranking of Arguments, in: Proceedings of the
     29th ACM International Conference on Information & Knowledge Management (CIKM
     2020), Association for Computing Machinery, 2020, pp. 335–344. URL: https://doi.org/10.
     1007/978-3-030-45439-5_29.
[20] H. Wachsmuth, N. Naderi, Y. Hou, Y. Bilu, V. Prabhakaran, T. A. Thijm, G. Hirst, B. Stein,
     Computational Argumentation Quality Assessment in Natural Language, in: Proceedings
     of the 15th Conference of the European Chapter of the Association for Computational
     Linguistics (EACL 2017), 2017, pp. 176–187. URL: http://aclweb.org/anthology/E17-1017.
[21] A. Nadamoto, K. Tanaka, A Comparative Web Browser (CWB) for Browsing and Comparing
     Web Pages, in: Proceedings of the 12th International World Wide Web Conference
     (WWW 2003), Association for Computing Machinery, 2003, pp. 727–735. URL: https:
     //doi.org/10.1145/775152.775254.
[22] J. Sun, X. Wang, D. Shen, H. Zeng, Z. Chen, CWS: A Comparative Web Search System,
     in: Proceedings of the 15th International Conference on World Wide Web (WWW 2006),
     Association for Computing Machinery, 2006, pp. 467–476. URL: https://doi.org/10.1145/
     1135777.1135846.
[23] N. Jindal, B. Liu, Identifying Comparative Sentences in Text Documents, in: Proceedings
     of the 29th Annual International Conference on Research and Development in Information
     Retrieval (SIGIR 2006), Association for Computing Machinery, 2006, pp. 244–251. URL:
     https://doi.org/10.1145/1148170.1148215.
[24] N. Jindal, B. Liu, Mining Comparative Sentences and Relations, in: Proceedings of the
     21st National Conference on Artificial Intelligence and the 18th Innovative Applications
     of Artificial Intelligence Conference (AAAI 2006), AAAI Press, 2006, pp. 1331–1336. URL:
     http://www.aaai.org/Library/AAAI/2006/aaai06-209.php.
[25] W. Kessler, J. Kuhn, A Corpus of Comparisons in Product Reviews, in: Proceedings of
     the 9th International Conference on Language Resources and Evaluation (LREC 2014),
     European Language Resources Association (ELRA), 2014, pp. 2242–2248. URL: http://www.
     lrec-conf.org/proceedings/lrec2014/summaries/1001.html.
[26] A. Panchenko, A. Bondarenko, M. Franzek, M. Hagen, C. Biemann, Categorizing Compara-
     tive Sentences, in: Proceedings of the 6th Workshop on Argument Mining (ArgMin-
     ing@ACL 2019), Association for Computational Linguistics, 2019, pp. 136–145. URL:
     https://doi.org/10.18653/v1/w19-4516.
[27] N. Ma, S. Mazumder, H. Wang, B. Liu, Entity-Aware Dependency-Based Deep Graph
     Attention Network for Comparative Preference Classification, in: Proceedings of the 58th
     Annual Meeting of the Association for Computational Linguistics (ACL 2020), Associa-
     tion for Computational Linguistics, 2020, pp. 5782–5788. URL: https://www.aclweb.org/
     anthology/2020.acl-main.512/.
[28] V. Chekalina, A. Bondarenko, C. Biemann, M. Beloucif, V. Logacheva, A. Panchenko, Which
     is Better for Deep Learning: Python or MATLAB? Answering Comparative Questions
     in Natural Language, in: Proceedings of the 16th Conference of the European Chapter
     of the Association for Computational Linguistics: System Demonstrations (EACL 2021),
     Association for Computational Linguistics, 2021, pp. 302–311. URL: https://www.aclweb.
     org/anthology/2021.eacl-demos.36/.
[29] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
     Transformers for Language Understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies (NAACL-HLT 2019), Association for Computational Linguistics,
     2019, pp. 4171–4186. URL: https://doi.org/10.18653/v1/n19-1423.
[30] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in:
     Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years
     of CLEF, volume 41 of The Information Retrieval Series, Springer, 2019, pp. 123–160. URL:
     https://doi.org/10.1007/978-3-030-22948-1_5.
[31] M. Fröbe, J. Bevendorff, L. Gienapp, M. Völske, B. Stein, M. Potthast, M. Hagen, CopyCat:
     Near-Duplicates within and between the ClueWeb and the Common Crawl, in: Proceedings
     of the 44th International ACM Conference on Research and Development in Information
     Retrieval (SIGIR 2021), Association for Computing Machinery, 2021. URL: https://dl.acm.
     org/doi/10.1145/3404835.3463246.
[32] M. Fröbe, J. Bevendorff, J. Reimer, M. Potthast, M. Hagen, Sampling Bias Due to Near-
     Duplicates in Learning to Rank, in: Proceedings of the 43rd International ACM Conference
     on Research and Development in Information Retrieval (SIGIR 2020), Association for
     Computing Machinery, 2020, pp. 1997–2000. URL: https://doi.org/10.1145/3397271.3401212.
[33] M. Fröbe, J. Bittner, M. Potthast, M. Hagen, The Effect of Content-Equivalent Near-
     Duplicates on the Evaluation of Search Engines, in: Proceedings of the 42nd European
     Conference on IR Research (ECIR 2020), volume 12036 of Lecture Notes in Computer Science,
     Springer, 2020, pp. 12–19. doi:10.1007/978-3-030-45442-5\_2.
[34] J. R. M. Palotti, H. Scells, G. Zuccon, TrecTools: an Open-source Python Library for
     Information Retrieval Practitioners Involved in TREC-like Campaigns, in: Proceedings of
     the 42nd International Conference on Research and Development in Information Retrieval
     (SIGIR 2019), Association for Computing Machinery, 2019, pp. 1325–1328. URL: https:
     //doi.org/10.1145/3331184.3331399.
[35] H. Nakayama, T. Kubo, J. Kamura, Y. Taniguchi, X. Liang, Doccano: Text Annotation
     Tool for Human, 2018. URL: https://github.com/doccano/doccano, software available from
     https://github.com/doccano/doccano.
[36] H. Wachsmuth, N. Naderi, I. Habernal, Y. Hou, G. Hirst, I. Gurevych, B. Stein, Argu-
     mentation Quality Assessment: Theory vs. Practice, in: Proceedings of the 55th Annual
     Meeting of the Association for Computational Linguistics (ACL 2017), Association for
     Computational Linguistics, 2017, pp. 250–255.
[37] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko,
     C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen,                   Overview of
     Touché 2020: Argument Retrieval, in: Working Notes Papers of the CLEF 2020 Evaluation
     Labs, volume 2696 of CEUR Workshop Proceedings, 2020. URL: http://ceur-ws.org/Vol-2696/.
[38] C. Fellbaum, WordNet: An Electronic Lexical Database, Bradford Books, 1998.
[39] C. Akiki, M. Potthast, Exploring Argument Retrieval with Transformers, in: Working
     Notes Papers of the CLEF 2020 Evaluation Labs, volume 2696, 2020. URL: http://ceur-ws.
     org/Vol-2696/.
[40] E. Raimondi, M. Alessio, N. Levorato, A Search Engine System for Touché Argument
     Retrieval task to answer Controversial Questions, in: Working Notes of CLEF 2021 -
     Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CLEF and
     CEUR-WS.org, 2021.
[41] E. Raimondi, M. Alessio, N. Levorato, Step approach to information retrieval., in: Working
     Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, CEUR Workshop
     Proceedings, CLEF and CEUR-WS.org, 2021.
[42] C. Akiki, M. Fröbe, M. Hagen, M. Potthast, Learning to Rank Arguments with Feature
     Selection, in: Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum,
     CEUR Workshop Proceedings, CLEF and CEUR-WS.org, 2021.
[43] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-
     Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, R. Kurzweil, Universal Sentence Encoder,
     CoRR abs/1803.11175 (2018). URL: http://arxiv.org/abs/1803.11175. arXiv:1803.11175.
[44] C. J. Burges, From RankNet to LambdaRank to LambdaMART: An Overview, Learning 11
     (2010) 81.
[45] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
     anov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, CoRR abs/1907.11692
     (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692.
[46] M. Carnelos, L. Menotti, T. Porro, , G. Prando, Touché Task1: Argument Retrieval for
     Controversial Questions. Resolution provided by Team Goemon, in: Working Notes of
     CLEF 2021 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings,
     CLEF and CEUR-WS.org, 2021.
[47] L. Gienapp, Quality-aware Argument Retrieval with Topical Clustering, in: Working
     Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, CEUR Workshop
     Proceedings, CLEF and CEUR-WS.org, 2021.
[48] A. Mailach, D. Arnold, S. Eysoldt, S. Kleine, Exploring Document Expansion for Argument
     Retrieval, in: Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum,
     CEUR Workshop Proceedings, CLEF and CEUR-WS.org, 2021.
[49] M. Alecci, T. B. amd Luca Martinelli, , E. Ziroldo, Development of an IR System for
     Argument Search, in: Working Notes of CLEF 2021 - Conference and Labs of the Evaluation
     Forum, CEUR Workshop Proceedings, CLEF and CEUR-WS.org, 2021.
[50] M. Lesk, Automatic sense disambiguation using machine readable dictionaries: how to
     tell a pine cone from an ice cream cone, in: V. DeBuys (Ed.), Proceedings of the 5th
     Annual International Conference on Systems Documentation, SIGDOC 1986, Toronto,
     Ontario, Canada, 1986, ACM, 1986, pp. 24–26. URL: https://doi.org/10.1145/318723.318728.
     doi:10.1145/318723.318728.
[51] R. Agarwal, A. Koniaev, R. Schaefer, Exploring Argument Retrieval for Controversial
     Questions Using Retrieve and Re-rank Pipelines, in: Working Notes of CLEF 2021 -
     Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CLEF and
     CEUR-WS.org, 2021.
[52] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-
     Networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
     Language Processing and the 9th International Joint Conference on Natural Language
     Processing (EMNLP-IJCNLP 2019), Association for Computational Linguistics, 2019, pp.
     3980–3990. URL: https://doi.org/10.18653/v1/D19-1410.
[53] E. D. Togni, A. Frasson, G. Zanatta, Exploring Approaches for Touché Task 1, in: Working
     Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, CEUR Workshop
     Proceedings, CLEF and CEUR-WS.org, 2021.
[54] R. Krovetz, Viewing Morphology as an Inference Process, in: Proceedings of the 16th
     Annual International Conference on Research and Development in Information Retrieval
     (SIGIR 1993), Association for Computing Machinery, 1993, pp. 191–202. URL: https://doi.
     org/10.1145/160688.160718.
[55] J. Lin, X. Ma, S. Lin, J. Yang, R. Pradeep, R. Nogueira, Pyserini: An Easy-to-Use Python
     Toolkit to Support Replicable IR Research with Sparse and Dense Representations, CoRR
     abs/2102.10073 (2021). URL: https://arxiv.org/abs/2102.10073.
[56] F. Berno, A. Cassetta, A. Codogno, E. Vicentini, , A. Piva, Shanks Touché Homework,
     in: Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, CEUR
     Workshop Proceedings, CLEF and CEUR-WS.org, 2021.
[57] K. Ros, C. Edwards, H. Ji, C. Zhai, Argument Retrieval and Visualization, in: Working
     Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, CEUR Workshop
     Proceedings, CLEF and CEUR-WS.org, 2021.
[58] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, MS MARCO:
     A Human Generated MAchine Reading COmprehension Dataset, in: Proceedings of
     the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches
     Co-located with the 30th Annual Conference on Neural Information Processing Systems
     (NIPS 2016), volume 1773 of CEUR Workshop Proceedings, CEUR-WS.org, 2016. URL: http:
     //ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf.
[59] C. Zhai, J. D. Lafferty, A study of smoothing methods for language models applied to ad
     hoc information retrieval, in: W. B. Croft, D. J. Harper, D. H. Kraft, J. Zobel (Eds.), SIGIR
     2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research
     and Development in Information Retrieval, September 9-13, 2001, New Orleans, Louisiana,
     USA, ACM, 2001, pp. 334–342. doi:10.1145/383952.384019.
[60] T. Green, L. Moroldo, A. Valente, Exploring BERT Synonyms and Quality Prediction
     for Argument Retrieval, in: Working Notes of CLEF 2021 - Conference and Labs of the
     Evaluation Forum, CEUR Workshop Proceedings, CLEF and CEUR-WS.org, 2021.
[61] M. Lippi, P. Torroni, MARGOT: A Web Server for Argumentation Mining, Expert Syst.
     Appl. 65 (2016) 292–303. URL: https://doi.org/10.1016/j.eswa.2016.08.050.
[62] S. Iyer, N. Dandekar, K. Csernai, First Quora Dataset Release: Question Pairs, 2017. Re-
     trieved at https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.
[63] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Elastic ChatNoir: Search Engine for the
     ClueWeb and the Common Crawl, in: Proceedings of the 40th European Conference on IR
     Research (ECIR 2018), volume 10772 of Lecture Notes in Computer Science, Springer, 2018,
     pp. 820–824. URL: https://doi.org/10.1007/978-3-319-76941-7_83.
[64] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in
     Vector Space, in: Proceedings of the 1st International Conference on Learning Representa-
     tions (ICLR 2013), 2013. URL: http://arxiv.org/abs/1301.3781.
[65] A. Trask, P. Michalak, J. Liu, sense2vec - A Fast and Accurate Method for Word Sense
     Disambiguation in Neural Word Embeddings, CoRR abs/1511.06388 (2015). URL: http:
     //arxiv.org/abs/1511.06388. arXiv:1511.06388.
[66] T. Chen, C. Guestrin, XGBoost: A Scalable Tree Boosting System, in: Proceedings of the
     22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
     Association for Computing Machinery, 2016, pp. 785–794. URL: https://doi.org/10.1145/
     2939672.2939785.
[67] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T. Liu, LightGBM: A Highly
     Efficient Gradient Boosting Decision Tree, in: Proceedings of the Annual Conference on
     Neural Information Processing Systems (NeurIPS 2017), 2017, pp. 3146–3154.
[68] J.-N. Weder, T. K. H. Luu, Argument Retrieval for Comparative Questions Based on
     Independent Features, in: Working Notes of CLEF 2021 - Conference and Labs of the
     Evaluation Forum, CEUR Workshop Proceedings, CLEF and CEUR-WS.org, 2021.
[69] V. Chekalina, A. Panchenko, Retrieving Comparative Arguments using Ensemble Methods
     and Neural Information Retrieval, in: Working Notes of CLEF 2021 - Conference and Labs
     of the Evaluation Forum, CEUR Workshop Proceedings, CLEF and CEUR-WS.org, 2021.
[70] L. Breiman, Random Forests, Mach. Learn. 45 (2001) 5–32. URL: https://doi.org/10.1023/A:
     1010933404324.
[71] S. MacAvaney, OpenNIR: A Complete Neural Ad-Hoc Ranking Pipeline, in: Proceedings
     of the 13th ACM International Conference on Web Search and Data Mining (WSDM 2020),
     Association for Computing Machinery, 2020, pp. 845–848. URL: https://doi.org/10.1145/
     3336191.3371864.
[72] H. Hashemi, M. Aliannejadi, H. Zamani, W. B. Croft, ANTIQUE: A Non-factoid Question
     Answering Benchmark, in: Proceedings of the 42nd European Conference on IR Research
     (ECIR 2020), volume 12036 of Lecture Notes in Computer Science, Springer, 2020, pp. 166–173.
     URL: https://doi.org/10.1007/978-3-030-45442-5_21.
[73] D. Helmrich, D. Streitmatter, F. Fuchs, M. Heykeroth, Touché Task 2: Comparative Argu-
     ment Retrieval. A Document-based Search Engine for Answering Comparative Questions,
     in: Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, CEUR
     Workshop Proceedings, CLEF and CEUR-WS.org, 2021.
[74] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language Models are
     Unsupervised Multitask Learners, OpenAI blog 1 (2019) 9.
[75] G. V. Cormack, M. D. Smucker, C. L. A. Clarke, Efficient and Effective Spam Filtering and
     Re-ranking for Large Web Datasets, Inf. Retr. 14 (2011) 441–465. URL: https://doi.org/10.
     1007/s10791-011-9162-z.
[76] A. Alhamzeh, M. Bouhaouel, E. Egyed-Zsigmond, J. Mitrović, DistilBERT-based Argumen-
     tation Retrieval for Answering Comparative Questions, in: Working Notes of CLEF 2021 -
     Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CLEF and
     CEUR-WS.org, 2021.
[77] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a Distilled Version of BERT: Smaller,
     Faster, Cheaper and Lighter, CoRR abs/1910.01108 (2019). URL: http://arxiv.org/abs/1910.
     01108. arXiv:1910.01108.
[78] E. Shirshakova, A. Wattar, Thor at Touché 2021: Argument Retrieval for Comparative
     Questions, in: Working Notes of CLEF 2021 - Conference and Labs of the Evaluation
     Forum, CEUR Workshop Proceedings, CLEF and CEUR-WS.org, 2021.
A. Full Evaluation Results of Touché 2021: Argument Retrieval
Table 5
Relevance results of all runs submitted to Task 1: Argument Retrieval for Controversial Questions.
Reported are the mean nDCG@5 and the 95% confidence intervals. The two baseline rankings of the
args.me search engine and DirichletLM are shown in bold.
Team                    Run Tag                                                        nDCG@5   CI95 Low   CI95 High
Elrond                  ElrondKRun                                                      0.720    0.651       0.785
Pippin Took             seupd2021-[. . . ]-Dirichlet-mu-2000.0-topics-2021              0.705    0.634       0.772
Pippin Took             seupd2021-[. . . ]-Dirichlet-mu-1500.0-topics-2021              0.702    0.626       0.767
Pippin Took             seupd2021-[. . . ]-Dirichlet-mu-1800.0-topics-2021              0.701    0.632       0.770
Robin Hood              robinhood_combined                                              0.691    0.628       0.752
Pippin Took             seupd2021-[. . . ]-Dirichlet-mu-2000.0-expanded-[. . . ]        0.688    0.611       0.760
Pippin Took             seupd2021-[. . . ]-Dirichlet-mu-1800.0-expanded-[. . . ]        0.683    0.606       0.760
Asterix                 run2021_Mixed_1.625_1.0_250                                     0.681    0.618       0.745
Dread Pirate Roberts    dreadpirateroberts_lambdamart_small_features                    0.678    0.605       0.743
Asterix                 run2021_Mixed_1.375_1.0_250                                     0.676    0.610       0.738
Elrond                  ElrondOpenNlpRun                                                0.674    0.600       0.746
Asterix                 run2021_Mixed_1.5_1.0_250                                       0.674    0.612       0.735
Elrond                  ElrondSimpleRun                                                 0.674    0.610       0.735
Robin Hood              robinhood_use                                                   0.672    0.613       0.733
Skeletor                bm25-0.7semantic                                                0.667    0.598       0.733
Skeletor                manifold-c10                                                    0.666    0.598       0.739
Skeletor                manifold                                                        0.666    0.587       0.737
Asterix                 run2021_Jolly_10.0_0.0_0.3_0.0__1.5_1.0_300                     0.663    0.602       0.724
Luke Skywalker          luke-skywalker                                                  0.662    0.598       0.732
Skeletor                bm25                                                            0.661    0.581       0.732
Shanks                  re-rank2                                                        0.658    0.593       0.720
Heimdall                argrank_r1_c10.0_q5.0                                           0.648    0.580       0.715
Dread Pirate Roberts    dreadpirateroberts_lambdamart_medium_features                   0.647    0.580       0.720
Robin Hood              robinhood_baseline                                              0.641    0.575       0.709
Heimdall                argrank_r1_c10.0_q10.0                                          0.639    0.569       0.710
Shanks                  re-rank1                                                        0.639    0.567       0.710
Shanks                  LMDSimilarity                                                   0.639    0.570       0.709
Athos                   uh-t1-athos-lucenetfidf                                         0.637    0.568       0.705
Heimdall                argrank_r1_c5.0_q10.0                                           0.637    0.565       0.702
Goemon Ishikawa         goemon2021-dirichlet-lucenetoken-atirestop-nostem               0.635    0.561       0.704
Jean-Pierre Polnareff   seupd-jpp-dirichlet                                             0.633    0.570       0.699
Goemon Ishikawa         [. . . ]-dirichlet-opennlptoken-terrierstop-nostem              0.630    0.558       0.698
Swordsman               Dirichlet_multi_field                                           0.626    0.559       0.698
Dread Pirate Roberts    dreadpirateroberts_dirichlet_filtered                           0.626    0.554       0.691
Goemon Ishikawa         [. . . ]-dirichlet-lucenetoken-terrierstop-[. . . ]-queryexp    0.625    0.559       0.692
Yeagerists              run_4_chocolate-sweep-50                                        0.625    0.551       0.693
Yeagerists              run_2_lunar-sweep-201                                           0.624    0.547       0.698
Hua Mulan               args_naiveexpansion_0                                           0.620    0.556       0.688
Hua Mulan               args_gpt2expansion_0                                            0.620    0.550       0.686
Goemon Ishikawa         [. . . ]-dirichlet-lucenetoken-lucenestop-nostem                0.620    0.552       0.689
Elrond                  ElrondTaskBodyRun                                               0.614    0.544       0.680
Robin Hood              robinhood_rm3                                                   0.611    0.532       0.688
Macbeth                 macbethPretrainedBaseline                                       0.611    0.532       0.688
Yeagerists              run_3_lunar-sweep-58                                            0.610    0.541       0.681
Yeagerists              run_1_lucene_pure_rev                                           0.609    0.543       0.677
Macbeth                 macbethBM25CrossEncoder                                         0.608    0.527       0.687
Macbeth                 macbethBM25BiEncoderCrossEncoder                                0.607    0.534       0.686
Swordsman               args.me                                                         0.607    0.528       0.676
Goemon Ishikawa         [. . . ]-dirichlet-lucenetoken-lucenestop-[. . . ]-queryexp     0.607    0.539       0.679
Blade                   bladeGroupBM25Method1                                           0.601    0.533       0.673
Shanks                  multi-1                                                         0.592    0.518       0.656
Shanks                  multi-2                                                         0.590    0.520       0.662
Blade                   bladeGroupLMDirichlet                                           0.588    0.516       0.658
Dread Pirate Roberts    dreadpirateroberts_run_mlm                                      0.577    0.505       0.654
Skeletor                semantic                                                        0.570    0.509       0.631
Asterix                 run2021_Baseline_BM25                                           0.566    0.510       0.624
Dread Pirate Roberts    dreadpirateroberts_universal-sentence-encoder-qa                0.557    0.487       0.616
Deadpool                uh-t1-deadpool                                                  0.557    0.476       0.631
Macbeth                 macbethBM25AugmentedBiEncoderCrossEncoder                       0.554    0.482       0.631
Yeagerists              run_5_good-sweep-85                                             0.536    0.456       0.612
Blade                   bladeGroupBM25Method2                                           0.528    0.438       0.612
Batman                  DE_RE_Analyzer_4r100                                            0.528    0.461       0.599
Little Foot             whoosh                                                          0.521    0.442       0.596
Hua Mulan               args_t5expansion_0                                              0.518    0.448       0.581
Macbeth                 macbethBiEncoderCrossEncoder                                    0.507    0.432       0.585
Gandalf                 BM25F-gandalf                                                   0.486    0.416       0.553
Palpatine               run                                                             0.401    0.334       0.472
Batman                  ER_v1                                                           0.397    0.309       0.486
Heimdall                argrank_r0_c0.1_q5.0                                            0.004    0.000       0.013
Heimdall                argrank_r0_c0.01_q5.0                                           0.000    0.000       0.000
Batman                  ER_Analyzer_5                                                   0.000    0.000       0.000
Table 6
Quality results of all runs submitted to Task 1: Argument Retrieval for Controversial Questions. Re-
ported are the mean nDCG@5 and the 95% confidence intervals. The two baseline rankings of the
args.me search engine and DirichletLM are shown in bold.
Team                    Run Tag                                                        nDCG@5   CI95 Low   CI95 High
Heimdall                argrank_r1_c10.0_q10.0                                          0.841    0.802       0.876
Heimdall                argrank_r1_c5.0_q10.0                                           0.839    0.803       0.875
Heimdall                argrank_r1_c10.0_q5.0                                           0.833    0.797       0.869
Skeletor                manifold                                                        0.827    0.783       0.868
Skeletor                bm25                                                            0.822    0.784       0.861
Skeletor                manifold-c10                                                    0.818    0.778       0.856
Asterix                 run2021_Jolly_10.0_0.0_0.3_0.0__1.5_1.0_300                     0.818    0.783       0.853
Elrond                  ElrondOpenNlpRun                                                0.817    0.777       0.856
Skeletor                bm25-0.7semantic                                                0.815    0.774       0.852
Asterix                 run2021_Mixed_1.375_1.0_250                                     0.814    0.774       0.853
Pippin Took             seupd2021-[. . . ]-Dirichlet-mu-1800.0-expanded-[. . . ]        0.814    0.773       0.852
Pippin Took             seupd2021-[. . . ]-Dirichlet-mu-2000.0-expanded-[. . . ]        0.814    0.774       0.850
Goemon Ishikawa         goemon2021-dirichlet-lucenetoken-atirestop-nostem               0.812    0.767       0.854
Hua Mulan               args_gpt2expansion_0                                            0.811    0.773       0.849
Dread Pirate Roberts    dreadpirateroberts_lambdamart_medium_features                   0.810    0.769       0.849
Yeagerists              run_4_chocolate-sweep-50                                        0.810    0.771       0.848
Elrond                  ElrondKRun                                                      0.809    0.765       0.853
Yeagerists              run_2_lunar-sweep-201                                           0.809    0.773       0.846
Robin Hood              robinhood_baseline                                              0.809    0.770       0.844
Luke Skywalker          luke-skywalker                                                  0.808    0.767       0.850
Asterix                 run2021_Mixed_1.5_1.0_250                                       0.807    0.764       0.848
Yeagerists              run_5_good-sweep-85                                             0.807    0.768       0.844
Goemon Ishikawa         [. . . ]-dirichlet-lucenetoken-terrierstop-[. . . ]-queryexp    0.806    0.764       0.845
Dread Pirate Roberts    dreadpirateroberts_lambdamart_small_features                    0.804    0.765       0.844
Robin Hood              robinhood_rm3                                                   0.804    0.755       0.850
Macbeth                 macbethBM25CrossEncoder                                         0.803    0.762       0.840
Athos                   uh-t1-athos-lucenetfidf                                         0.802    0.758       0.844
Jean-Pierre Polnareff   seupd-jpp-dirichlet                                             0.802    0.763       0.838
Asterix                 run2021_Mixed_1.625_1.0_250                                     0.802    0.758       0.843
Pippin Took             seupd2021-[. . . ]-Dirichlet-mu-1800.0-topics-2021              0.799    0.760       0.838
Yeagerists              run_3_lunar-sweep-58                                            0.799    0.760       0.838
Yeagerists              run_1_lucene_pure_rev                                           0.798    0.755       0.837
Pippin Took             seupd2021-[. . . ]-Dirichlet-mu-2000.0-topics-2021              0.798    0.758       0.839
Goemon Ishikawa         [. . . ]-dirichlet-opennlptoken-terrierstop-nostem              0.797    0.757       0.836
Pippin Took             seupd2021-[. . . ]-Dirichlet-mu-1500.0-topics-2021              0.797    0.755       0.834
Goemon Ishikawa         [. . . ]-dirichlet-lucenetoken-lucenestop-nostem                0.796    0.756       0.837
Swordsman               Dirichlet_multi_field                                           0.796    0.759       0.837
Dread Pirate Roberts    dreadpirateroberts_dirichlet_filtered                           0.796    0.757       0.839
Goemon Ishikawa         [. . . ]-dirichlet-lucenetoken-lucenestop-[. . . ]-queryexp     0.796    0.756       0.836
Shanks                  re-rank1                                                        0.795    0.754       0.836
Shanks                  LMDSimilarity                                                   0.795    0.757       0.835
Shanks                  re-rank2                                                        0.790    0.750       0.826
Hua Mulan               args_naiveexpansion_0                                           0.789    0.747       0.830
Elrond                  ElrondTaskBodyRun                                               0.788    0.742       0.830
Macbeth                 macbethPretrainedBaseline                                       0.783    0.738       0.824
Macbeth                 macbethBM25BiEncoderCrossEncoder                                0.783    0.743       0.828
Dread Pirate Roberts    dreadpirateroberts_run_mlm                                      0.779    0.737       0.820
Heimdall                argrank_r0_c0.1_q5.0                                            0.767    0.725       0.811
Blade                   bladeGroupLMDirichlet                                           0.763    0.706       0.815
Robin Hood              robinhood_combined                                              0.756    0.708       0.806
Macbeth                 macbethBM25AugmentedBiEncoderCrossEncoder                       0.752    0.704       0.801
Blade                   bladeGroupBM25Method1                                           0.751    0.705       0.799
Macbeth                 macbethBiEncoderCrossEncoder                                    0.750    0.701       0.802
Heimdall                argrank_r0_c0.01_q5.0                                           0.749    0.707       0.793
Elrond                  ElrondSimpleRun                                                 0.740    0.693       0.785
Robin Hood              robinhood_use                                                   0.732    0.680       0.782
Little Foot             whoosh                                                          0.718    0.661       0.766
Swordsman               args.me                                                         0.717    0.663       0.773
Blade                   bladeGroupBM25Method2                                           0.705    0.639       0.766
Batman                  DE_RE_Analyzer_4r100                                            0.695    0.638       0.751
Shanks                  multi-2                                                         0.684    0.627       0.739
Deadpool                uh-t1-deadpool                                                  0.679    0.618       0.738
Shanks                  multi-1                                                         0.674    0.616       0.728
Skeletor                semantic                                                        0.671    0.602       0.737
Asterix                 run2021_Baseline_BM25                                           0.671    0.619       0.721
Batman                  ER_Analyzer_5                                                   0.671    0.598       0.741
Batman                  ER_v1                                                           0.662    0.589       0.721
Hua Mulan               args_t5expansion_0                                              0.654    0.584       0.727
Dread Pirate Roberts    dreadpirateroberts_universal-sentence-encoder-qa                0.624    0.558       0.681
Gandalf                 BM25F-gandalf                                                   0.603    0.532       0.672
Palpatine               run                                                             0.562    0.497       0.633
Table 7
Relevance results of all runs submitted to Task 2: Argument Retrieval for Comparative Questions. Re-
ported are the mean nDCG@5 and the 95% confidence intervals; ChatNoir baseline in bold.
Team             Run Tag                                           nDCG@5    CI95 Low    CI95 High
Katana           py_terrier_xgb                                     0.489       0.421       0.557
Thor             uh-t2-thor                                         0.478       0.400       0.563
Rayla            DistilBERT_argumentation_advanced_ranking_run_1    0.473       0.409       0.540
Rayla            DistilBERT_argumentation_advanced_ranking_run_3    0.471       0.399       0.538
Jack Sparrow     Jack Sparrow__bert                                 0.467       0.396       0.533
Rayla            DistilBERT_argumentation_bm25                      0.466       0.392       0.541
Katana           lgbm_ranker                                        0.460       0.395       0.531
Rayla            DistilBERT_argumentation_advanced_ranking_run_2    0.458       0.395       0.525
Mercutio         ul-t2-mercutio-run_2                               0.441       0.374       0.503
Jack Sparrow     Jack Sparrow_                                      0.422       0.357       0.489
Puss in Boots    ChatNoir                                           0.422       0.354       0.490
Katana           rand_forest                                        0.393       0.328       0.461
Katana           run_tf.txt                                         0.385       0.320       0.456
Katana           run.txt                                            0.377       0.311       0.445
Mercutio         ul-t2-mercutio-run_1                               0.372       0.306       0.438
Jack Sparrow     Jack Sparrow__argumentative_bert                   0.341       0.293       0.391
Jack Sparrow     Jack Sparrow__argumentative                        0.340       0.277       0.408
Mercutio         ul-t2-mercutio-run_3                               0.320       0.258       0.386
Prince Caspian   prince-caspian                                     0.244       0.174       0.321
Katana           bert_test                                          0.091       0.057       0.127


Table 8
Quality results of all runs submitted to Task 2: Argument Retrieval for Comparative Questions. Re-
ported are the mean nDCG@5 and the 95% confidence intervals; ChatNoir baseline in bold.
Team             Run Tag                                           nDCG@5    CI95 Low    CI95 High
Rayla            DistilBERT_argumentation_bm25                      0.688       0.614       0.758
Katana           lgbm_ranker                                        0.684       0.624       0.749
Thor             uh-t2-thor                                         0.680       0.606       0.760
Katana           py_terrier_xgb                                     0.675       0.605       0.740
Rayla            DistilBERT_argumentation_advanced_ranking_run_1    0.670       0.592       0.743
Jack Sparrow     Jack Sparrow__bert                                 0.664       0.596       0.735
Jack Sparrow     Jack Sparrow_                                      0.652       0.582       0.718
Mercutio         ul-t2-mercutio-run_2                               0.651       0.577       0.728
Puss in Boots    ChatNoir                                           0.636       0.559       0.713
Katana           run_tf.txt                                         0.630       0.560       0.702
Rayla            DistilBERT_argumentation_advanced_ranking_run_2    0.630       0.542       0.709
Katana           rand_forest                                        0.628       0.558       0.691
Rayla            DistilBERT_argumentation_advanced_ranking_run_3    0.625       0.548       0.696
Jack Sparrow     Jack Sparrow__argumentative_bert                   0.620       0.568       0.667
Mercutio         ul-t2-mercutio-run_1                               0.610       0.537       0.679
Katana           run.txt                                            0.608       0.537       0.673
Jack Sparrow     Jack Sparrow__argumentative                        0.606       0.542       0.668
Prince Caspian   prince-caspian                                     0.548       0.457       0.630
Mercutio         ul-t2-mercutio-run_3                               0.530       0.454       0.600
Katana           bert_test                                          0.466       0.388       0.542

</pre>