Overview of Touché 2022: Argument Retrieval Extended Version* Alexander Bondarenko1 , Maik Fröbe1 , Johannes Kiesel2 , Shahbaz Syed3 , Timon Gurcke4 , Meriem Beloucif5 , Alexander Panchenko6 , Chris Biemann7 , Benno Stein2 , Henning Wachsmuth4 , Martin Potthast3 and Matthias Hagen1 1 Martin-Luther-Universität Halle-Wittenberg 2 Bauhaus-Universität Weimar 3 Leipzig University 4 Paderborn University 5 Uppsala University 6 Skolkovo Institute of Science and Technology 7 Universität Hamburg touche@webis.de https://touche.webis.de Abstract This paper is a report on the third year of the Touché lab on argument retrieval hosted at CLEF 2022. With the goal of supporting and promoting the research and development of new technologies for argument mining and argument analysis, we have organized three shared tasks: (a) argument retrieval for controversial topics, where the task is to find sentences that reflect the gist of arguments from online debates, (b) argument retrieval for comparative issues, where the task is to find argumentative passages from web documents that help in making a comparative decision, and (c) image retrieval for arguments, where the task is to find images that show support for or opposition to a particular stance. Keywords Argument retrieval, Controversial questions, Comparative questions, Image retrieval, Shared task 1. Introduction Decision-making and opinion-forming are everyday tasks, often involving weighing pro and con arguments for or against different options. Considering the many arguments on almost any topic on the web, in principle anyone can come to an informed decision or opinion with the help of a search engine. However, large parts of the easily accessible arguments on the web are of low quality. They may contain incoherent logic, fail to substantiate a claim, or use inappropriate language. These arguments should not appear at the top of search results—regardless of whether a query is about socially important issues or “only” personal choices. Challenges arising from this observation range from evaluating the relevance of an argument to a query and assessing how well an implied stance is justified, to identifying the gist of an argument, to finding images that illustrate a particular stance. Commercial web search engines do not sufficiently address these challenges—a gap we aim to fill with the Touché labs. *This overview extends the one published as part of the CLEF 2022 proceedings [1]. ‘Touché’ is commonly “used to acknowledge a hit in fencing or the success or appropriateness of an argument” (merriam-webster.com) CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Following the two successful Touché labs on argumentation at CLEF 2020 and 2021 [2, 3], our third lab edition again brought together researchers from the fields of information retrieval and natural language processing who study argumentation. At Touché 2022, we have organized the following three shared tasks, the last of which is a completely new addition: 1. Argumentative sentence retrieval from a focused collection (crawled from debate portals) to support conversations about controversial topics. 2. Argument retrieval from a large collection of text passages to support answering compar- ative questions in personal decision making. 3. Argumentative image retrieval to support the illustration of arguments and getting an overview of the public opinion on controversial topics. Touché follows the traditional TREC methodology: documents and topics are provided to participants, who then submit their results (up to five runs) for each topic to be assessed by human assessors. While the first two Touché editions focused on full argument and document retrieval, the third edition focused on more fine-grained retrieval units. The three shared tasks investigated whether argument retrieval can more directly support decision making and opinion formation by extraction of the gist of documents, classification of their stance on an issue as pro or con, and retrieval of images that support or oppose a particular stance. The teams that participated in the third Touché lab were able to use the topics and assessments (relevance and quality of arguments) from the previous lab editions to train and optimize their approaches. In addition to traditional retrieval models such as BM25 [4], re-ranking approaches such as the recent transformer-based models T5 [5] and T0 [6] have been applied with the goal of combining topical relevance with “argumentativeness,” argument quality, or stance. They are an essential part of the most effective approaches of all three Touché tasks, confirming the general trend in information retrieval and natural language processing that pre-trained transformers achieve good effectiveness [7] (cf. Sections 4—6). The most effective approach submitted to Task 1 re-ranks the DirichletLM model’s search results by first using a BERT-based classifier [8] to decide on the argumentativeness of retrieved sentence pairs (i.e., whether they are premises or assertions), then estimating their coherence using the cosine similarity of their BERT embeddings. For Task 2, in terms of relevance, a TCT-ColBERT ranker [9] and, in terms of quality, a combination of query-dependent BM25F scores [10] and predicted argument quality were most effective. The most effective approach for Task 3 (across topic relevance, argumentativeness, and stance relevance) used BERT instead of a stance detection model to detect the sentiment of texts from web pages and texts in images and indexed both with BM25F. Altogether, the most effective argument retrieval approaches used various strategies for query reformulation and expansion, and for re-ranking based on estimates of argument quality or “argumentativeness”. Sentiment or emotion recognition was particularly useful for the argumentative image retrieval task, as well as OCR to retrieve image text for analysis. The corpora, topics, and judgments created at Touché are freely available to the research community and can be found on the lab’s website.1 Parts of the data are also already available via the BEIR [11] and ir_datasets [12] resources. 1 https://webis.de/events.html?q=Touche#shared-tasks 2. Related Work Queries in argument retrieval often are phrases that describe a controversial topic, questions that ask to compare two options, or even complete claims or short arguments [13]. In the third edition of the Touché lab, we address the first two query types in three different shared tasks on argument retrieval in general, on comparative scenarios, and on image retrieval. Here, we briefly summarize the related work for all three tasks. 2.1. Argument Retrieval The goal of argument retrieval is to find arguments that help when making a decision, when forming an opinion, or when trying to convince (or persuade) someone of a specific point of view. An argument is usually modeled as a conclusion with one or more supporting or attacking premises [14]. While a conclusion is a statement that can be accepted or rejected, a premise is a more grounded statement (e.g., statistical evidence or a referenced quote). Adding argument retrieval components to a search engine poses challenges like identifying argumentative queries [15], mining arguments from documents, or assessing an argument’s relevance and quality [14]. Different paradigms have been proposed for actual argument retrieval that perform argument mining and ranking in different order [16]. For instance, Wachsmuth et al. [14] use distant supervision and extract and index arguments from debate portals in a “pre-processing”. Their argument search engine args.me2 uses BM25F [10] to then only rank the extracted arguments at query time, giving more weight to conclusions than premises. Also Levy et al. [17] use distant supervision to mine arguments from Wikipedia in an offline pre-processing before ranking. Following a different paradigm, Stab et al. [18] retrieve documents from the Common Crawl3 at query time (no prior offline argument mining) and use a topic-dependent neural network to then extract arguments from the retrieved documents. In our Touché tasks, we address both paradigms, the one of Wachsmuth et al. [14] in Task 1 (retrieval from a focused collection of pre-processed arguments) and the one of Stab et al. [18] in Task 2 (retrieval from some general collection with online argument mining). Argument retrieval should take topical relevance into account but also argument quality. What makes a good argument has been studied since the time of Aristotle [19]. Wachsmuth et al. [20] categorize the different aspects of argument quality into a taxonomy that covers three dimensions: logic, rhetoric, and dialectic. Logic concerns the strength of the internal structure of an argument (i.e., the conclusion and the premises along with their relations) while rhetoric covers the effectiveness of an argument in persuading an audience with its conclusion. Lastly, dialectic addresses the relations of an argument to other arguments on the topic. For example, an argument attacked by many others may be rather vulnerable in a debate. Note that an argument’s relevance to a query is also categorized under dialectical quality [20]. Argument relevance has been typically assessed by an argument’s similarity to a given topic and by incorporating the support and attack relations to other arguments. Potthast et al. [21] evaluate four standard retrieval models for ranking arguments with regard to topical relevance, logic, rhetoric, and dialectic. One of the main findings is that DirichletLM is better at 2 https://www.args.me/ 3 http://commoncrawl.org ranking arguments than BM25, DPH, and TF-IDF. Gienapp et al. [22] later proposed a pairwise annotation strategy that reduces the costs of crowdsourcing argument retrieval annotations by 93% (i.e., requiring the annotation of only a rather small subset of argument pairs). As for argument ranking, several approaches exploit argument relations. For instance, Wachsmuth et al. [23] connect two arguments in a graph when one uses the other’s conclusion as a premise and then compute an argument’s PageRank [24] in this graph. In their study, taking PageRank into account improves upon baselines that only use an argument’s content and internal structure (conclusion and premises) [23]. Later, Dumani et al. [25] used support and attack relations between clusters of premises and claims as well as between clusters of claims and a query. In an extended version, Dumani and Schenkel [26] also include the quality of a premise as a probability (fraction of premises that are worse with regard to cogency, reasonableness, and effectiveness). Using a pairwise quality estimator trained on the Dagstuhl-15512 ArgQuality Corpus [27], the approach with the argument quality component was more effective on the 50 topics of Task 1 from Touché 2020 than the one without taking argument quality into account. 2.2. Retrieval for Comparisons Comparative information needs in web search have first been addressed with basic interfaces for comparing two products entered separately in two search boxes [28, 29]. Using opinion mining approaches, comparative sentences can then be identified from product reviews in favor of or against one or the other product [30, 31, 32]. Recently, identifying a comparison preference in a sentence (i.e., the “winning” option) has also been tackled more broadly (not just for product reviews) [33, 34] and forms the basis of the comparative argumentation machine CAM [35]. Similar to the early comparison interfaces, CAM takes two objects and some comparison aspect(s) as input, retrieves comparative sentences in favor of one or the other option using BM25, and then classifies the sentences’ preferences for a final merged table-like result presentation. A proper argument ranking, however, was not included in CAM. Chekalina et al. [36] later extended the system to accept complete comparative questions as input and to return a natural language answer. From a comparative question, the comparison objects, aspect(s), and predicates are extracted and the system’s answer is either generated directly based on transformers [8] or by retrieval from an index of comparative sentences. To identify comparative questions and information needs, Bondarenko et al. [37, 38] propose a cascading ensemble of classifiers (rule-based, feature-based, and neural models). They also propose improved approaches to extract the comparison objects, aspects, and predicates from comparative questions and to detect the stance of potential answers towards the comparison objects. The respective stance dataset could also be used by the participants of our Task 2. 2.3. Image Retrieval Images can provide contextual information and express, underline, or popularize an opinion [39], thereby taking the form of subjective statements [40]. While some images can be complete arguments (i.e., expressing both, a premise and a conclusion) [41, 42] others provide contextual information only and have to be combined with a textual conclusion to form an argument. A recent SemEval task distinguished a total of 22 persuasion techniques in memes alone [43]. Moreover, argument quality dimensions like acceptability, credibility, emotional appeal, and sufficiency [27] all also apply to arguments that include images. Pre-dated only by approaches relying on metadata and similarity measures [44], the actual content of images or videos has been analyzed and used for keyword-based image search for decades [45]. In a recent survey, Latif et al. [46] categorize image features into color, texture, shape, and spatial features but commercial search engines also index text found in images, surrounding text, alternative texts displayed when an image is unavailable, and the image URLs [47, 48]. As for the retrieval of argumentative images, a closely related concept is “emotional images”, which is based on image features like color and composition [49, 50]. Since argumentation often goes hand in hand with emotions, emotional features may also be promising for retrieving images for arguments, a relatively new task recently proposed by Kiesel et al. [51] and now forming Task 3 of the Touché 2022 lab. 3. Lab Overview and Statistics For the third edition of the Touché lab, we received 58 registrations, doubling the number from the previous year (29 registrations in 2021). Among the teams, 27 registered for more than one task, 17 registered particularly for Task 1, 10 for Task 2, and 4 for Task 3 (the new task this year). The majority of registrations came from Germany and Italy (13 each), followed by India (12), the United States (3), the Netherlands, France, Switzerland, Bangladesh (2 each), Pakistan, Portugal, United Kingdom, Indonesia, China, Russian Federation, Bulgaria, Nigeria, and Lebanon (1 each). Aligned with the lab’s fencing-related title, the registered teams selected a real or fictional fencer or swordsman character (e.g., D’Artagnan) as their team name. From the 58 registered teams, 23 actively participated in the tasks and submitted results4 (27 teams submitted in 2021 and 17 teams in 2020). Using the setup of the previous Touché editions, we encouraged the teams to deploy their software in TIRA [52] for a better reproducibil- ity of the developed approaches. The TIRA integrated research architecture is cloud-based evaluation-as-a-service platform where shared task participants can deploy their software in a dedicated virtual machine to which they have full administrative access. By default, the virtual machines run the server version of Ubuntu 20.04 with one Intel Xeon E5-2620 CPU, 4 GB RAM, 16 GB HDD, and the latest versions of often-used software packages pre-installed (e.g., Docker and Python). If needed, we tried to customize the resources as per a team’s requirements. Providing GPUs was not possible, though. For teams that did not deploy their software in TIRA, we allowed run submissions similar to many TREC tracks. In case they preferred software submissions, the teams created their run using via web UI of TIRA by remote-executing their software inside their virtual machine. The software is fully installed in the virtual machine, and at execution time the virtual machine is shut down, disconnected from the internet, powered on again in a sandbox mode, and the test datasets for the respective tasks are mounted. Interrupting the internet connection ensures that the participants’ software works without external web services that may disappear or become incompatible, which could reduce reproducibility (i.e., downloading additional external code or models during the execution is not possible). We offered support in case of problems 4 Three teams did not submit a paper describing their approach, though. during deployment and then archived the virtual machines that the participants used for their submissions. The respective systems can thus be re-evaluated or also applied to new datasets with the same input format. Overall, 9 of the 23 teams submitted traditional run instead of deploying their software in TIRA. Per team, we allowed 5 runs and the run needed to follow the standard TREC format.5 We checked the validity of each submitted run and asked participants to rerun their software or resubmit their files in case of problems while also offering support in such cases. In total, 84 runs were submitted—at least one from each team. 4. Task 1: Argument Retrieval for Controversial Questions The goal of the Touché 2022 lab’s first task was to support individuals who search for opinions and arguments on socially important controversial topics like “Are social networking sites good for our society?”. Such scenarios benefit from obtaining the gists of various web resources that briefly summarize different stances (pro or con) on controversial topics. The task we considered in this regard followed the idea of extractive argument summarization [53]. 4.1. Task Definition and Data Task. Given a controversial topic and a collection of arguments, the task was to retrieve sentence pairs that represent the gist of their corresponding arguments (e.g., the main claim and a supporting premise). Sentences in such a pair may not contradict each other and ideally build upon each other in a logical manner comprising a coherent text. Topics. We used 50 controversial topics from the previous iterations of Touché. Each topic is formulated as a question that the user might pose as a query to the search engine, accompanied by a description summarizing the information need and the search scenario, along with a narrative to guide assessors in recognizing relevant results (see Table 1). Document collection. The document collection for Task 1 was based on the args.me cor- pus [16] which contains about 400,000 structured arguments (crawled from the online debate portals debatewise.org, idebate.org, debatepedia.org, and debate.org). It is freely available for download6 and can also be accessed through the args.me API.7 To account for this year’s changes in the task definition (the focus on gists), we prepared a pre-processed version of the corpus. Preprocessing steps included sentence splitting and removing premises and conclu- sions shorter than two words, resulting in 5,690,642 unique sentences with 64,633 claims and 5,626,509 premises. 5 The expected format was also described at the lab’s web page: https://webis.de/events/touche-22/ 6 https://webis.de/data.html#args-me-corpus 7 https://www.args.me/api-en.html Table 1 Example topic for Task 1: Argument Retrieval for Controversial Questions. Number 34 Title Are social networking sites good for our society? Description Democracy may be in the process of being disrupted by social media, with the potential creation of individual filter bubbles. So a user wonders if social networking sites should be allowed, regulated, or even banned. Narrative Highly relevant arguments discuss social networking in general or particular networking sites, and its/their positive or negative effects on society. Relevant arguments discuss how social networking affects people, without explicit reference to society. 4.2. Evaluation Setup Participants submitted their rankings as traditional TREC-style runs where document IDs are sorted by descending relevance score for each search topic (i.e., the most relevant argument occurs at Rank 1). Given the large number of runs and the possibility of retrieving up to 1000 documents (in our case, these are sentence pairs) per topic in a run, using TrecTools [54], we created the pools using a top-5 pooling strategy, resulting in 6,930 unique sentence pairs for manual assessment of relevance, quality (argumentativeness), and textual coherence. Relevance was judged by our volunteer assessors on a three-point scale: 0 (not relevant), 1 (relevant), and 2 (highly relevant). For quality, annotators assessed whether a retrieved pair of sentences are rhetorically well-written on a three-point scale: 0 (low quality/non-argumentative), 1 (average quality), and 2 (high quality). Textual coherence (if the two sentences in a pair logically build upon each other) was also judged on a three-point scale: 0 (unrelated/contradicting), 1 (average coherence), and 2 (high coherence). 4.3. Submitted Approaches and Evaluation Results This year’s approaches included standard retrieval models such as TF-IDF, BM25, DirichletLM, and DPH. Participants also used third-party toolkits, such as the Project Debater API [55] (for stance and evidence detection in arguments), Apache OpenNLP8 (for language detection), and BERT-based classifiers proposed by Reimers et al. [56] trained on the Webis Argument Quality Corpus [22] and the IBM Rank 30K dataset [57] for argument quality detection. Addi- tionally, semantic similarity of word and sentence embeddings based on doc2vec [58], Spacy embeddings [59], and SBERT [60] have been employed for retrieving coherent sentence pairs as required by the task definition. One team leveraged the text generation capabilities of GPT-2 [61] to find subsequent sentences while another team similarly used the next sentence prediction (NSP) of BERT [8] for this. These toolkits augmented the document preprocessing and re-ranking of the retrieved results. 8 https://opennlp.apache.org/ Table 2 Results of Task 1 (Argument Retrieval for Controversial Questions). Shown are the scores of a teams’ best run for the three dimensions relevance, quality, and coherence of the retrieved sentence pairs with along a run’s rank (results of all submitted runs in Tables 6–8). The teams are ordered alphabetically; baseline Swordsman emphasized. A † indicates statistically significant differences to the baseline (paired Student’s 𝑡-test, 𝑝 = 0.05, Bonferroni-correction). Team nDCG@5 Rank Relevance Rank Quality Rank Coherence † † Bruce Banner 3 0.651 5 0.772 4 0.378 D’Artagnan 4 0.642† 7 0.733† 5 0.378† Daario Naharis 2 0.683† 1 0.913† 1 0.458† Gamora 5 0.616† 3 0.785† 7 0.285 General Grevious 9 0.403 10 0.517 10 0.231 Gorgon 8 0.408 6 0.742† 8 0.282 Hit Girl 6 0.588† 4 0.776† 6 0.377 Korg 11 0.252 11 0.453† 11 0.168 Pearl 7 0.481 8 0.678 3 0.398† Porthos 1 0.742† 2 0.873† 2 0.429† Swordsman 10 0.356 9 0.608 9 0.248 We used nDCG@5 to evaluate of relevance, quality, and coherence. Table 2 shows the results of the best run per team. On all the evaluated dimensions at least eight out of ten teams managed to beat the provided baseline. Similar to previous years’ results, quality is best covered by the approaches followed by relevance and the newly added coherence dimension. Summarizing the results, for relevance, Team Porthos [62] achieved the highest rank followed by Daario Naharis [63] with nDCG@5 scores of 0.742 and 0.683, respectively. For the quality and coherence dimensions Daario Naharis obtained the highest scores (0.913 and 0.458) followed by Porthos (0.873 and+0.429). We believe that the two-stage re-ranking employed by Daario Naharis improved coherence and quality in comparison to the other approaches. They first ensured that retrieved pairs were relevant to their context in the argument alongside the topic which preserved high-quality arguments. Then, a second re-ranking based on stance to determine the final pairing of the retrieved sentences boosted coherence. Below, we briefly describe our baseline and summarize the submitted approaches. Our baseline Swordsman employed a graph-based approach that ranks arguments’ sentences by their centrality in the corresponding argument graph as proposed by Alshomary et al. [53]. The top two sentences per argument are used as the their gist. We retrieved 1000 pairs per topic. Bruce Banner [64] employed the BM25 retrieval model implemented in the Pyserini toolkit [65] with its default parameters (𝑘1 = 1.2 and 𝑏 = 0.68). For each argument, they indexed all possible sentence pairs. To speed up computation on such a large collection of sentence pairs, they specifically opted for the sparse representations in Pyserini that produce smaller indexes compared to the dense retrieval variants. Two query variants were used: original query (topic title) and an expanded query (narrative and description appended). Likewise, two variants of the sentence pairs were indexed: original pair and pair with the topic of a debate appended. They retrieved 1000 documents per query and did not apply any re-ranking. D’Artagnan [66] also employed sparse retrieval together with text preprocessing and query expansion. For retrieval, they used two retrieval models from Lucene: BM25 [10] (𝑘1 = 1.2 and 𝑏 = 0.75) and DirichletLM (𝜇 = 2000). For preprocessing, they experimented with both Porter [67] and Krovetz [68] stemmers. Additionally, they filtered both character and word n- grams (referred to as shingles) and used two stop word lists (SMART System [69], Glasgow IR.9 ) Query expansion was done using synonyms from WordNet [70] and word2vec [71]. Evaluation on the previous year’s relevance judgments showed that a combination of the DirichletLM retrieval model, the Krovitz stemmer, and the Glasgow IR stop word list improved performance compared to their respective counterparts. Daario Naharis [63] developed a standard Lucene-based document retrieval system using the TF-IDF model. Additionally, they introduced a new measure called ICoefficient for scoring the discriminant power of a term. This complements the standard TF-IDF weighting by additionally considering the number of documents that contain at least one occurrence of a given term. We refer readers to Bahrami et al. [63] for the mathematical formulation of the ICoefficient. For preprocessing, they created two custom stop lists, each composed of the 100 most frequent terms in the indexed collections of the argument contexts and individual arguments from the provided corpus. Document re-ranking was performed based on stance and evidence detection using the Project Debater API [55]. Gamora [72] developed Lucene-based approaches using deduplication and contextual feature- enriched indexing, adding the topic of a debate and the stance on the topic, to obtain document- level relevance and quality scores, following the approaches used in previous Touché editions [3]. To find relevant sentence pairs rather than relevant documents, these results were used to limit the number of documents by creating a new index for only the sentences of relevant documents (double indexing) or creating all possible sentence combinations and ranking them based on a weighted average of the argument quality (estimated using an SVM classifier) of the pair and its source document. BM25 [65] (𝑘1 = 1.2 and 𝑏 = 0.75) and DirichletLM (𝜇 = 2000) were used for document similarity and SBERT [60] and TF-IDF for sentence similarity. The best approach is based on double indexing and a combination of a manual query reduction in which only the 2—6 main words of the query were kept, query boosting, query decorators, query expansion with respect to important keywords (GloVE [73]) and synonyms (WordNet [70]), and possessive removal, stemming (Krovetz stemmer [68]) and length filtering of the sentences. General Grevious [74] used a conventional IR pipeline based on Lucene. First, documents were lowercased, tokenized and possessive words (with trailing ‘’s’) were removed, keeping only tokens with a length between 3 and 20 characters. In addition, the team experimented with a variety of stemming approaches (S-stemmer [75], Krovetz stemmer [68], Porter stemmer [67], no stemming) and stop word lists (Core NLP [76], CountWordsFree [65], EBSCO,10 GoogleStop,11 and Ranks.12 ) To retrieve documents, BM25 [65] (𝑘1 = 1.2 and 𝑏 = 0.75) and DirichletLM (𝜇 ∈ {1700, 1800}) were used together with query boosting, by assigning weights to the used inputs (argument, conclusion, debate title, and argument title), and query expansion, by finding 9 https://github.com/igorbrigadir/stopwords/ 10 https://connect.ebsco.com/s/article/What-are-stop-words-and-how-does-EBSCO-s-search-engine-handle-them? 11 https://www.semrush.com/blog/seo-stop-words/ 12 https://www.ranks.nl/stopwords keywords (Rapid Automatic Keyword Extraction (RAKE) [77]) and synonyms (Datamuse13 ). This retrieval step was done once for the documents and once for all the potential sentence pairs within these retrieved documents to obtain a ranking of sentence pairs. Finally, sentiment analysis (Vader [78]) was used to boost documents that have a similar sentiment as the query, and readability analysis (Flesch-Kincaid [79]) was used for re-ranking. Their best model does not include re-ranking, stemming, and stop word removal but relies solely on the combination of query expansion and the BM25 retrieval model. Gorgon [80] also used a Lucene-based IR pipeline and compared BM25 [65] (𝑘1 = 1.2 and 𝑏 = 0.75) and DirichletLM (𝜇 = 2000) similarity measures, developing four different analyzers with different preprocessing steps including lowercasing, stemming (Krovetz stemmer [68]), removing possessive words (with trailing ‘’s’) and filtering stop words (99webtools,14 EBSCO). Sentence pairs were created from all combinations within a single document before indexing. The best approach is a combination of lowercasing, removing possessive words, and BM25. Hit Girl [81] proposed a two-stage retrieval pipeline that combines semantic search and re-ranking via argument quality agnostic models. Documents were embedded to vectors using Spacy [59]. These were then indexed via Elasticsearch and its text similarity function used for semantic search. They experimented with three approaches for re-ranking: maximal marginal relevance [82], word mover’s distance [83], and a novel method called structural distance which employs fuzzy matching between query and sentences based on POS tags. Preliminary evaluations showed that, while re-ranking improved the argument quality to varying degrees, it also affected relevance. Also, structural distance performed best for re-ranking. Korg’s [84] approaches are based on the Elasticsearch implementation of DirichletLM (𝜇 = 2000) to find the best matching argumentative sentences for a query after employing lowercasing, ASCII folding, stop word filtering (manually created stop word list) and stemming (Krovetz stemmer [68]). Then, either doc2vec [58] or SBERT [60] is trained on all sentences in the args.me corpus, which was used to find the most similar sentence pair within a document by direct comparison of the doc2vec embeddings. Alternatively, instead of directly comparing sentences, GPT-2 [61] was used to generate the next sentence for a given sentence to then find the most similar sentence to the generated sentence. The best approach is based on lowercasing, ASCII folding, stop word filtering, stemming, and doc2vec’s similarity calculation without GPT-2. Pearl [85] also proposed a two-stage retrieval pipeline using DirichletLM [86] and DPH [87] models to retrieve argumentative sentences. For both stages, they used the PyTerrier toolkit [88]. After retrieving the documents, two BERT-based argument quality models fine-tuned on the Webis Argument Quality Corpus [89], and the IBM-Rank-30k dataset [57] were employed to filter non-argumentative results. The resulting prototype from the first stage was considered the baseline model. On evaluating this on a set of 35 queries taken from the provided topics, they found that the DPH model assigned high relevance to sentences even if their terms are part of a URL, or other meta data in the corpus. Moreover, it was also susceptible to homonyms and thus negatively affecting the retrieval performance. To account for this, a refined prototype was developed that combined argument quality prediction with query expansion. For query expansion, they applied the Bo1 query expansion algorithm provided by PyTerrier which 13 https://www.datamuse.com/api/ 14 https://99webtools.com/blog/list-of-english-stop-words/ weighs the terms based on divergence from randomness build on Bose-Einstein statistics [90]. Specifically, the Bo1 model extracts terms from the top-ranked documents retrieved for the original query, weighs them based on their informativeness, and appends the highest-weighted terms to the original query to expand it. Finally, a custom block list consisting of commonly repeated phrases such as “my opponent claims...”, “PRO claims...”, “I accept this debate” filtered further noisy sentences, leading to improved nDCG scores. Porthos [62] used the Elasticsearch implementation of DirichletLM (with 𝜇 = 116 being the average length of sentences in the corpus) and BM25 [65] (default Elasticsearch implementation with 𝑘1 = 1.2 and 𝑏 = 0.75) or retrieval after removing sentence duplicates and filtering non-relevant sentences by removing ones with low-quality language to retain only the ones that contain at least one verb. Another filtering step is based on the argumentativeness of sentences using the support vector machine (SVM) of [22] and the BERT approach of [56]. In addition, sentences were stemmed, lowercased and stop words were removed. The approaches are based on a search term as a composition of single terms and Boolean queries together with Reimers et al. [56] to reorder the retrieved sentences according to their argumentative quality. The sentences are paired with SBERT [60] and BERT [8] trained on Next Sentence Prediction (NSP). The best approach is based on DirichletLM, NSP, using the sentence classifier in preprocessing, Boolean queries with Noun Chunking for retrieval, and the BERT approach of [56] for re-ranking. 5. Task 2: Argument Retrieval for Comparative Questions The goal of the Touché 2022 lab’s second task was to support informed decisions in “everyday” or personal comparison situations—for instance for a question like “Should I major in philosophy or psychology?”. Decision making in such situations benefits from finding balanced reasons for choosing one option over the other, usually in form of opinions or arguments. 5.1. Task Definition and Data Task. Given a collection of text passages and a comparative topic with two comparison objects, the task was to retrieve relevant argumentative passages for or against one or both objects, and to detect the passages’ stances with respect to the objects. Topics. We provided 50 topics that describe scenarios of personal decision making. Each topic has a title formulated as a comparative question, a pair of comparison objects from the title that could be used for the stance detection of the retrieved passages, a description with some background on the particular search scenario, and a narrative that served as a guideline for our assessors (cf. Table 3 for an example). Document collection. The retrieval collection for Task 2 was a corpus of 868,655 passages extracted from ClueWeb12.15 We constructed this passage corpus using all 37,248 documents from the top-100 pool of all runs submitted to Task 2 in the previous Touché editions. Using the 15 https://lemurproject.org/clueweb12/index.php Table 3 Example topic for Task 2: Argument Retrieval for Comparative Questions. Number 88 Title Should I major in philosophy or psychology? Objects major in philosophy, psychology Description A soon-to-be high-school graduate finds themself at a crossroad in their life. Based on their interests, majoring in philosophy or in psychology are the potential options and the graduate is searching for information about the differences and similarities, as well as advantages and disadvantages of majoring in either of them (e.g., with respect to career opportunities or gained skills). Narrative Relevant documents will overview one of the two majors in terms of career prospects or developed new skills, or they will provide a list of reasons to major in one or the other. Highly relevant documents will compare the two majors side-by-side and help to decide which should be preferred in what context. Not relevant are study pro- gram and university advertisements or general descriptions of the disciplines that do not mention benefits, advantages, or pros/cons. TREC CAsT tools,16 we split the documents at sentence boundaries into fixed-length passages of approximately 250 terms, since ranking fixed-length passages was shown to be more effective than that of variable-length passages [91]. From the initial 1,286,977 passages, we removed near-duplicates with CopyCat [92] to mitigate unwanted side-effects of near-duplicates on retrieval effectiveness [93, 94], resulting in the final collection of 868,655 passages. We also provided a second version of the corpus, in which the passages were expanded with queries generated by the docT5query model [95]. To lower the bar to entry of this task, we also provided the participants with a number of previously compiled resources. These included the document-level relevance and argument quality judgments from the previous Touché editions as well as the passage-level relevance judgments from a subset of MS MARCO [96] with about 40,000 comparative questions identified by an ALBERT-based [97] classifier [38]. Each question in MS MARCO is associated with 10 text passages (one is labeled as most relevant). To train stance detectors, an annotated dataset of 950 comparative questions and answers, extracted from Stack Exchange, was also provided [38]. For the identification of claims and premises, the participants could use any own or existing argument tagging tool, such as the API17 of TARGER [98] hosted on our own servers. 5.2. Evaluation Setup Similar to Task 1, we pooled the top-5 passages from the runs, resulting in 2,107 unique passages that were manually judged. Our volunteer human assessors labeled the passages’ relevance with three labels: 0 (not relevant), 1 (relevant), and 2 (highly relevant). They also assessed 16 https://github.com/grill-lab/trec-cast-tools 17 Also available as a Python library: https://pypi.org/project/targer-api/ whether arguments are present in a passage and whether they are rhetorically well-written [27] with three labels: 0 (low quality, or no arguments in a passage), 1 (average quality), and 2 (high quality). Finally, we asked the assessors to label passages with respect to a topic’s comparison objects as (a) pro first object, (b) pro second object, (c) neutral (both comparison objects are equally good or bad), and (d) no stance (no stance given). In Task 2, we used nDCG@5 for the relevance and argument quality dimensions and macro-averaged F1 for the stance detection. 5.3. Submitted Approaches and Evaluation Results Seven teams submitted their results to Task 2 (25 valid runs). Interestingly, only two teams used relevance judgments from the previous Touché editions to fine-tune their models or to optimize parameters. The others either manually labeled a sample of retrieved documents themselves or relied on zero-shot approaches like the transformer-based model T0++ [6]. Most teams used the standard passage collection, but two teams also used the docT5query-expanded [95] collection provided by us. Overall, the main trend of this year was the usage of transformer-based models for ranking and re-ranking (e.g., ColBERT [99] or monoT5 and duoT5 [100]) while our baseline approach was BM25, as in the previous years. For the optional subtask of stance detection, five of the seven teams submitted results. They either trained their own classifiers on the provided stance dataset, fine-tuned pre-trained language models, or directly used pre-trained models as zero-shot classifiers. Our baseline stance detector was a simple always-‘no stance’ predictor (majority class). Table 4 shows the results of each team’s most effective runs with respect to relevance and argument quality (more detailed results for each submitted run can be found in Appendix A). For stance detection, for each team, we evaluated all passages that were part of the manual judgment pool and for which the team had predicted a stance (i.e., the stance of a passage returned at Rank 3 by some Team X (and thus part of the judgment pool) was also used in the stance evaluation of Team Y, even when the document was only on Rank 6 or lower (and thus not actually part of the pool for that run). Note that this potentially yields different numbers of passages used for the stance evaluation per team. Below, we briefly describe the teams’ submitted approaches and their results (teams ordered by their relevance-wise best approach). Captian Levi [101] submitted the relevance-wise most effective run. They first retrieved 2,000 documents using Pyserini’s BM25 [65] (𝑘1 = 1.2 and 𝑏 = 0.68) by combining top-1000 results for the original query (topic title) with the results for modified queries, where they used alternative strategies: (1) only removing stop words (using the NLTK [102] stop word list), (2) replacing comparative adjectives with synonyms and antonyms found in WordNet [70], (3) adding extra terms using pseudo-relevance feedback, (4) using queries generated with the docT5query model [95] provided by the Touché organizers. Queries and corpus were also processed by using stop words and punctuation removal and lemmatization (WordNet lemmatizer). The initially retrieved results were re-ranked using monoT5 and duoT5 [100]. Additionally, TCT-ColBERT [9] (a variant of ColBERT [99] with knowledge distillation) was also used for initial ranking for unmodified queries (topic titles). Captain Levi submitted in total five runs that differ in the aforementioned strategies of modifying queries, initial ranking models, and final re-ranking models. Their most effective run in terms of relevance and quality was initial ranking by TCT-ColBERT. Finally, stance was detected using a RoBERTa-Large-MNLI Table 4 Results of Task 2 (Argument Retrieval for Comparative Questions). (a) Evaluation results of a team’s best run according to the results’ relevance. (b) Best runs according to the results’ quality. (c) Stance detection results (the teams’ ordering is the same as in (b)). An asterisk (⋆ ) indicates that the runs with the best relevance and the best quality differ for a team. The baseline BM25 ranking is shown in bold; the baseline stance detector always predicts ‘no stance’. A † indicates statistically significant differences to the baseline (paired Student’s 𝑡-test, 𝑝 = 0.05, Bonferroni-correction). Since stance detection results were calculated for different numbers of predictions for each team, we do not test statistical differences. Tables 9–11 show the results for all submitted runs. (a) Best relevance score per team (b) Best quality score per team (c) Stance Team nDCG@5 Team nDCG@5 F1 macro Rel. Qual. Qual. Rel. Rank Score † ⋆ † Captain Levi 0.758 0.744 Aldo Nadi 0.774 0.695 — Aldo Nadi⋆ 0.709† 0.748 Captain Levi 0.744† 0.758 1 0.261 Katana⋆ 0.618† 0.643 Katana⋆ 0.644† 0.601 3 0.220 Captain Tempesta⋆ 0.574† 0.589 Captain Tempesta⋆ 0.597† 0.557 — Olivier Armstrong 0.492 0.582 Olivier Armstrong 0.582 0.492 4 0.191 Puss in Boots 0.469 0.476 Puss in Boots 0.476 0.469 5 0.158 Grimjack 0.422 0.403 Grimjack 0.403 0.422 2 0.235 Asuna 0.263† 0.332 Asuna 0.332† 0.263 6 0.106 model [103], pre-trained on the Multi-Genre Natural Language Inference corpus [104] without further fine-tuning in two steps: (1) detecting if the document has a stance, and then (2) for documents that were not classified as ‘neutral’ or ‘no stance’, detecting which comparison object the document favors. This stance detector achieved the highest macro-averaged F1 score. Aldo Nadi [105] submitted the quality-wise most effective run. They re-ranked passages that were initially retrieved with BM25F [10] (default Lucene implementation with 𝑘1 = 1.2 and 𝑏 = 0.75) on two fields: text of the original passages, and passages expanded with docT5query. All texts were processed with the Porter stemmer [67], removing stop words using different lists: (a) Snowball [106], (b) a default Lucene stop word list, (c) a custom list containing the 400 most frequent terms in the retrieval collection, excluding the comparison objects. Queries (topic titles) were expanded using a relevance feedback method based on the Rocchio Algorithm [107]. For the final ranking, the team experimented with two re-ranking techniques (involving up to the top-1000 documents from the initial results): (1) exploiting the argument quality estimation, i.e., they multiplied the document relevance and the quality scores, and (2) Reciprocal Ranking Fusion [108]. The quality scores were predicted using the IBM Project Debater API [55]. Aldo Nadi submitted five runs, which vary by different combinations of the proposed methods, e.g., using different stop word lists for pre-processing, using relevance feedback or not, using the quality-based re-ranking or fusion. The team’s most effective run in terms of relevance used relevance feedback, and the most effective run in terms of quality was based on Reciprocal Ranking Fusion. The did not detect the stance. Katana [109] submitted three runs that all used different variants of ColBERT [99]: (1) pre- trained on MS MARCO [96] by the University of Glasgow,18 (2) pre-trained by Katana from scratch on MS MARCO, replacing a cosine similarity between a query and a document repre- sentation with L2 distance, and (3) the latter model fine-tuned on the relevance and quality judgments from the previous Touché editions. As queries the team used topic titles without additional processing. The team’s most effective run in terms of relevance used ranking by pre-trained ColBERT, and the most effective run in terms of quality used ranking by training Col- BERT from scratch (without further fine-tuning). For stance detection, Katana used a pre-trained XGBoost-based classifier that is part of Comparative Argumentation Machine [35, 33]. Captain Tempesta [110] exploited linguistic properties of text such as a non-informative symbol frequency (hashtags, emojis, etc.), a difference between a short words’ (less or equal than 4 characters) frequency and a long words’ (more than 4 characters) frequency, and adjective as well as comparative adjective frequencies. Based on these properties for each document in the retrieval corpus, a quality score was computed as a weighted sum (weights were assigned manually). At query time, the relevance score of BM25 (Lucene; default: 𝑘1 = 1.2 and 𝑏 = 0.75) was multiplied with the quality score, used as ranking criterion. Queries (topic titles) were processed by removing stop words (Lucene default list) and lowercasing query terms except for brand names,19 stemming them using Lovins stemmer [111]. The team’s five submitted runs differ in the weights manually assigned for the different quality properties. The team’s most effective run in terms of relevance used document quality estimation with linguistic properties, and the most effective run in terms of quality did not. The team did not detect stance. Olivier Armstrong [112] submitted one run. They first identified the comparison objects, aspects, and predicates in queries (topic titles) using a RoBERTa-based classifier proposed by Bondarenko et al. [38]. After removing stop words, queries were expanded with synonyms of the objects, aspects, and predicates found using WordNet. Then 100 documents were retrieved using Elasticsearch’s BM25 (𝑘1 = 1.2 and 𝑏 = 0.75) as initial ranking. Using a DistilBERT-based classifier [113], fine-tuned by Alhamzeh et al. [114] (a Touché 2021 participant), Olivier Armstrong identified premises and claims in the retrieved documents. For ranking, the following scores were calculated for each candidate document: (1) the arg-BM25 score returned by querying the new re-indexed corpus (only premises and claims are kept) using the unmodified queries (topic titles), (2) the argument support score, i.e., the ratio of premises and claims in the document, (3) the similarity score, i.e., the averaged cosine similarity between the original query and every premise and claim in the document represented using the SBERT embeddings [60]. The final score for each candidate document was calculated as sum of the normalized individual scores. Their final ranking included 25 documents. For stance detection, the team used an LSTM-based neural network with one hidden layer that was pre-trained on the provided stance dataset. Puss in Boots was our baseline retrieval model that used the BM25 implementation in Py- serini [65] with default parameters (𝑘1 = 0.9 and 𝑏 = 0.4) and original topic titles as queries. The baseline stance detector simply assigned ‘no stance’ to all documents in the ranked list. Grimjack [115] submitted five runs using query expansion and query reformulation, argument quality estimation, stance detection, and axiomatic re-ranking. For the first ranking result, the 18 http://www.dcs.gla.ac.uk/∼craigm/colbert.dnn.zip 19 https://github.com/MatthiasWinkelmann/english-words-names-brands-places team simply retrieved 100 passages ranked with a Pyserini implementation of DirichletLM (default 𝜇 = 1000), using original, unmodified queries (topic titles). Another approach re-ranked the top-10 of the initially retrieved passages using (1) argument axioms that “prefer” documents with more premises and claims (identified with TARGER [98]) or earlier occurrence of query terms in premises and claims [116, 117], (2) newly proposed comparative axioms that “prefer” documents with more comparison objects or their earlier occurrence in premises and claims, and (3) an argument quality axiom that ranks higher documents with higher argument quality scores calculated using the IBM Project Debater API [55]. For another result ranking, document positions (from the previous run) were changed based on the predicted stance, such as the ‘pro first object’ document was followed by the ‘pro second object’ followed by ‘neutral’ stance. The document stance was predicted using the IBM Project Debater API [55]. The last two runs used T0++ [6] (1) to expand queries, e.g., by combining topic titles with newly generated queries, where T0++ was prompted to generate a question given a topic’s description, (2) to assess the argument quality, and (3) to detect the stance in zero-shot settings. These two runs differed in whether a stance balancing was used. The team’s most effective run in terms of relevance and quality used axiomatic re-ranking, and re-ranking based on the detected stance. Asuna [118] preprocessed each document (passage) in the retrieval corpus by (1) creating a one-sentence extractive summary using LexRank [119], (2) identifying premises and claims with TARGER [98], and (3) looking up the spam score in the Waterloo Spam Rankings dataset [120].20 The modified corpus was indexed, and initial retrieval of the top-40 documents was performed with the Pyserini [65] implementation of BM25F (default 𝑘1 = 0.9 and 𝑏 = 0.4) using the unmodified queries (topic titles) over the index fields with original passages, summaries, and premises and claims. Next, the queries were lemmatized and stop words were removed using the NLTK library, and expanded with the most frequent terms coming from LDA topics [121] for the initially retrieved documents. The expanded queries were used to, again, retrieve the top-40 passages with BM25F. Finally, Asuna re-ranked the retrieved documents using a random forest classifier [122] with the following features: BM25F score, number of times the document was retrieved for different queries (original, three extended with the LDA topics for documents, and one extended with the LDA topic for the task topic description), number of tokens in documents, number of sentences in documents, number of premises in documents, number of claims in documents, spam-score, predicted argument quality score, and predicted stance. The classifier was trained on the Touché 2020 and 2021 relevance judgments. The argument quality was predicted using DistilBERT, fine-tuned on the Webis-ArgQuality-20 corpus [89]. The stance was also predicted using DistilBERT, fine-tuned on the provided stance dataset. 6. Task 3: Image Retrieval for Arguments The goal of the Touché 2022 lab’s third task was to provide argumentation support through image search. The retrieval of relevant images should provide both a quick visual overview of frequent arguments on some topic, and for compelling images to support one’s argumentation. The goal of the third task was thus to retrieve images that indicate an agreement or disagreement to some stance on a given topic as two separate lists similar to textual argument search. 20 https://lemurproject.org/clueweb12/related-data.php 6.1. Task Definition and Data Task. Given a controversial topic, the task was to retrieve images (from web pages) for each stance (pro and con) that show support for that stance. Topics. Task 3 uses the same 50 controversial topics as Task 1 (cf. Section 4). Document collection. This task’s document collection stems from a focused crawl of 23,841 images and associated web pages from late 2021. For each of the 50 topics, we is- sued 11 queries (with different filter words like “good,” “meme,” “stats,” “reasons,” or “effects”) to Google’s image search and downloaded the top 100 images and associated web pages; 868 dupli- cate images were identified and removed using pHash21 and manual checks. The dataset contains for each image: (1) the image itself in both WebP and PNG format, (2) its URL; (3) its pHash. Moreover, the dataset contains for each page: (1) its URL; (2) the Google rank of the page for each query for which the image was retrieved; (3) a WARC web archive;22 (4) a DOM HTML snapshot; (5) its complete text; (6) a screenshot; (7) meta-information of each DOM node, including the node’s xPath, CSS attributes, and position on the screenshot; and (8) the xPath of the corre- sponding image in the DOM HTML snapshot. The full dataset is 368 GB large.23 To kickstart machine learning approaches, we provided 334 relevance judgments from Kiesel et al. [51]. 6.2. Evaluation Setup We employed crowdsourcing on Amazon Mechanical Turk24 to evaluate the topical relevance, argumentativeness, and stance of the 6,607 image-topic pairs from all runs, employing 5 inde- pendent annotators each. Specifically, we asked for each topic for which an image was retrieved: (1) Is the image in some manner related to the topic? (2) Do you think most people would say that, if someone shares this image without further comment, they want to show they approve of the pro-side to the topic? (3) Or do you think most people would rather say the one who shares this image does so to show they disapprove? We described each topic using the topic’s title, modified as necessary to convey the description and narrative (cf. Table 1) and to clarify which stance is approve (pro) and disapprove (con). We then iteratively employed MACE [123] to identify image–topic pairs with low annotator agreement (MACE confidence ≤ 0.55) and re-judged them ourselves, employing our judgments as check instances for another iteration of MACE. We repeated this procedure until MACE predicted the labels for all image–topic pairs from the runs with a confidence above 0.55 (re-judging 2,056 images total). 6.3. Submitted Approaches and Evaluation Results In total, 3 teams submitted 12 runs to this task. The teams pursued quite different approaches. However, all participants employed OCR (specifically Tesseract25 ) to extract image text. The 21 https://www.phash.org/ 22 Archived using https://github.com/webis-de/scriptor 23 Available at https://webis.de/data.html#touche22-image-retrieval-for-arguments 24 https://www.mturk.com 25 https://github.com/tesseract-ocr/tesseract Table 5 Results of Task 3 (Image Retrieval for Arguments) in terms of Precision@10 (per stance) for topic relevance, argumentativeness, and stance relevance. The table shows the best run for each team across all three measures. Results for the baseline are shown in bold. A † indicates statistically significant differences to the baseline (paired Student’s 𝑡-test, 𝑝 = 0.05, Bonferroni-correction). Table 12 shows the results for all submitted runs. Team Run Precision@10 Topic Arg. Stance † † Boromir BERT, OCR, query-processing 0.878 0.768 0.425 Minsc Baseline 0.736 0.686 0.407 Aramis Argumentativeness:formula, stance:formula 0.701† 0.634 0.381 Jester With emotion detection 0.696† 0.647† 0.350† teams Boromir and Jester also used the associated web page’s text, but Team Jester restricted to text close to the image on the web page. Each team used sentiment or emotion features, based on image colors (Aramis), faces in the images (Jester), image text (all), and the web page text (Boromir, Jester). Boromir used the ranking information for internal evaluation. We used Precision@10 for evaluation: the ratio of relevant images among 10 retrieved images for each topic and stance. Table 5 shows the results of each team’s most effective run. For each team, the best runs were the same with respect to all three measures. Minsc represents our baseline run, which ranks images in the same order as our original Google queries, namely of the query that includes the filter word “good” for pro and of the query that includes “anti” for con. We considered this a tough baseline, especially for on-topic relevance, as topical relatedness is similar for argumentative and “standard” web image search. However, Boromir beat this baseline—with a considerable margin for on-topic relevance. Aramis [124] focused on image features. No retrieval model was employed, but all images evaluated for each topic. They tested the use of a heuristic formula vs. fully-connected neural network classifiers for both argumentativeness and stance detection. Features were based on OCR (text length in characters, text area size, and cells in an 8×8 grid with high text density, VADER sentiment score [78]), image color (average color, dominant color, and percentages of pixels with each of these color ranges as per self-defined RGB-buckets: red, green, blue, yellow, light, and dark), image category (graphic vs. photo [125]; percentage of area covered by diagrams26 ), and query–text similarity (whether the query is fully contained, the overlap for an optimal query alignment, and VADER sentiment score of words in a six-token radius around occurrences of query terms in the text). However, the query–text similarity features were not used for argumentativeness classification, as the team assumed this sub-task to be query- independent. In our evaluation, the formula performed better than the neural approaches, which Aramis traced back to the formula being slightly better at handling off-topic images—with topical relevance not being the team’s focus, they had trained and internally evaluated the network on on-topic images only. However, their worst runs still achieved a similar Precision@10 as their best one, namely 0.664 (topic; -0.037 compared to best run), 0.609 (argumentativeness; -0.025), and 0.344 (stance; -0.037). Moreover, for an evaluation that ignores the problem of topical 26 Based on a Stackoverflow answer, archived as https://perma.cc/KE6J-KMQT relevance, the ratio of argumentative images among topical relevant images for their runs is between 0.904 (using both formulas) and 0.927 (using both networks), and thus very close to the baseline, which reaches a ratio of 0.932. Boromir [126] indexed both image text (boosted five-fold) and web page text (Elastic- search BM25 with default settings, 𝑘1 = 1.2 and 𝑏 = 0.75), using lowercasing, URL, punctuation and number removal, NLTK’s WordNet lemmatization [102], removal of tokens consisting of exactly one letter, stop word removal (using the list from NLTK), and min-frequency filter- ing (removing tokens that appear less than three times in the text). They clustered images into 13 clusters (as determined by the elbow criterion) using 𝑘-Means and manually assigned retrieval boosts per cluster to favor more argumentative images, especially diagrams. For example, clusters with the highest boost of 5.0 were found to contain, upon manual inspection, “graphics with text (e.g., memes, quotes, twitter posts),” “graphics with round forms and text (e.g., pie charts),” “statistical graphics but with better quality [...] (e.g., bar plots, tables, line plots),” and “statistical plots (bar plots and line plots).” On the other hand, not boosted were images from clusters that were found to contain mostly photos (five clusters). They employed textual sentiment detection for stance detection, using either a dictionary (AFINN [127]) or a BERT classifier. Their approach performed best and convincingly improved over the baseline. The BERT classifier improved over the dictionary-based classifier whereas image clustering was detrimental. Specifically, the image clustering seemed to introduce more off-topic images into the ranking: the same setup as the best run but using image clusters achieved a Precision@10 of 0.822 (topic; -0.056), 0.728 (argumentativeness; -0.040), and 0.411 (stance; -0.014). Jester focused on emotion-based image retrieval via facial image recognition (using FER27 ), image text, and the associated web page’s text that is close to the image in the HTML source code—for which they use the text within the image’s parent element. Similar to stance detection in the args.me search engine [14], they assign positive-leaning images to the pro-stance and negative-leaning images to the con-stance. For comparison, they submitted a second run without emotion features (thus plain retrieval), which achieved a lower Precision@10: 0.671 (topic; -0.025), 0.618 (argumentativeness; -0.029), and 0.336 (stance; -0.014). Thus emotion features seem helpful but insufficient when taken alone. 7. Conclusion The third edition of the Touché lab at CLEF 2022 featured three shared tasks: (1) argument retrieval for controversial questions, (2) argument retrieval for comparative questions, and (3) image retrieval for arguments. Compared to previous editions, retrieval units have been changed (sentences/passages instead of full arguments/documents and images as a completely new unit) and stance detection has been included. Of 58 registered teams, 23 participated in the tasks and submitted at least one valid run. In addition to sparse retrieval and various query processing, reformulation, and expansion methods, approaches have increasingly focused on transformer models and re-ranking techniques. Not only was the quality of the documents and arguments evaluated, but also the predicted stance taken into account for the final rankings. 27 https://github.com/justinshenk/fer The most effective approaches to argument retrieval all share common characteristics. For example, most use various strategies for query reformulation and expansion, such as synonyms, relevance feedback, or generating new queries with pre-trained language models. An interesting observation is that re-ranking first-stage search results based on a quality assessment of the arguments almost always improves retrieval effectiveness. Specifically for Task 2 ( comparative questions), re-ranking based on important terms such as comparison objects and aspects or argument units in documents (premises and assertions) was successful. In task 2, stance detection was a new subtask, and some participants included a re-ranking step based on the predicted stance in their retrieval pipelines, which had some promising effects on retrieval effectiveness. However, the overall still rather low effectiveness of the approaches to stance detection leaves room for future improvements. For Task 3 (image retrieval), the recognition of sentiment and emotion and the use of OCR to analyze the text in images were particularly helpful. We plan to continue Touché as a collaborative platform for researchers in argument retrieval. All Touché resources are freely available, including topics, manual relevance and argument quality assessments, and submitted runs from participating teams. These resources, the submis- sion and evaluation tools, and other events such as workshops will help to further foster the community working on argument retrieval. In the future, we plan to expand the evaluation pools and to include additional dimensions of argument quality. Improving stance detection and exploiting predicted stances better not only for ranking text arguments but also for images are also interesting tasks for future work. Acknowledgments We are very grateful to the CLEF 2022 organizers and the Touché participants, who allowed this lab to happen. We also want to thank our volunteer annotators who helped to create the relevance and argument quality assessments and our reviewers for their valuable feedback on the participants’ notebooks. This work was partially supported by the Deutsche Forschungsgemeinschaft (DFG) through the projects “ACQuA 2.0” (Answering Comparative Questions with Arguments; project num- ber 376430233) and “OASiS” (Objective Argument Summarization in Search; project number 455913891) as part of the priority program “RATIO: Robust Argumentation Machines” (SPP 1999), and the German Ministry for Science and Education (BMBF) through the project “SharKI” (Shared Tasks as an Innovative Approach to Implement AI and Big Data-based Applications within Universities; grant FKZ 16DHB4021). We are also grateful to Jan Heinrich Reimer for developing the TARGER Python library and Erik Reuter for expanding a document collection for Task 2 with docT5query. References [1] A. Bondarenko, M. Fröbe, J. Kiesel, S. Syed, T. Gurcke, M. Beloucif, A. Panchenko, C. Bie- mann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2022: Argument retrieval, in: Experimental IR Meets Multilinguality, Multimodality, and In- teraction. 13th International Conference of the CLEF Association (CLEF 2022), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2022. [2] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument retrieval, in: Working Notes Papers of the CLEF 2020 Evaluation Labs, volume 2696 of CEUR Workshop Proceedings, 2020. URL: http://ceur-ws.org/Vol-2696/. [3] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument retrieval, in: Working Notes Papers of the CLEF 2021 Evaluation Labs, volume 2936 of CEUR Workshop Proceedings, 2021. URL: http://ceur-ws.org/Vol-2936/. [4] S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, M. Gatford, Okapi at TREC-3, in: Proceedings of The Third Text REtrieval Conference, TREC 1994, volume 500-225 of NIST Special Publication, NIST, 1994, pp. 109–126. URL: https://trec.nist.gov/pubs/trec3/ papers/city.ps.gz. [5] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res. 21 (2020) 140:1–140:67. URL: http://jmlr.org/papers/v21/20-074.html. [6] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V. Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Févry, J. A. Fries, R. Teehan, S. Biderman, L. Gao, T. Bers, T. Wolf, A. M. Rush, Multitask prompted training enables zero-shot task generalization, CoRR abs/2110.08207 (2021). URL: https://arxiv.org/abs/2110.08207. arXiv:2110.08207. [7] J. Lin, R. Nogueira, A. Yates, Pretrained Transformers for Text Ranking: BERT and Beyond, Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers, 2021. URL: https://doi.org/10.2200/S01123ED1V01Y202108HLT053. doi:10. 2200/S01123ED1V01Y202108HLT053. [8] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, ACL, 2019, pp. 4171–4186. URL: https://doi. org/10.18653/v1/n19-1423. [9] S. Lin, J. Yang, J. Lin, Distilling dense representations for ranking using tightly- coupled teachers, CoRR abs/2010.11386 (2020). URL: https://arxiv.org/abs/2010.11386. arXiv:2010.11386. [10] S. E. Robertson, H. Zaragoza, M. J. Taylor, Simple BM25 extension to multiple weighted fields, in: Proceedings of the 13th International Conference on Information and Knowl- edge Management, CIKM 2004, ACM, 2004, pp. 42–49. URL: https://doi.org/10.1145/ 1031171.1031181. [11] N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, I. Gurevych, BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models, in: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL: https://openreview.net/forum?id=wCu6T5xFjeJ. [12] S. MacAvaney, A. Yates, S. Feldman, D. Downey, A. Cohan, N. Goharian, Simplified data wrangling with ir_datasets, in: Proceedings og the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021, ACM, 2021, pp. 2429–2436. URL: https://doi.org/10.1145/3404835.3463254. [13] H. Wachsmuth, S. Syed, B. Stein, Retrieval of the best counterargument without prior topic knowledge, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Association for Computational Linguistics, 2018, pp. 241–251. URL: https://www.aclweb.org/anthology/P18-1023/. [14] H. Wachsmuth, M. Potthast, K. Al-Khatib, Y. Ajjour, J. Puschmann, J. Qu, J. Dorsch, V. Morari, J. Bevendorff, B. Stein, Building an argument search engine for the web, in: Proceedings of the Fourth Workshop on Argument Mining (ArgMining), Association for Computational Linguistics, 2017, pp. 49–59. URL: https://doi.org/10.18653/v1/w17-5106. [15] Y. Ajjour, P. Braslavski, A. Bondarenko, B. Stein, Identifying argumentative questions in web search logs, in: 45th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2022), ACM, 2022. doi:10.1145/3477495.3531864. [16] Y. Ajjour, H. Wachsmuth, J. Kiesel, M. Potthast, M. Hagen, B. Stein, Data acquisition for argument search: The args.me corpus, in: Proceedings of the 42nd German Conference on AI, KI 2019, volume 11793 of Lecture Notes in Computer Science, Springer, 2019, pp. 48–59. URL: https://doi.org/10.1007/978-3-030-30179-8_4. [17] R. Levy, B. Bogin, S. Gretz, R. Aharonov, N. Slonim, Towards an argumentative content search engine using weak supervision, in: Proceedings of the 27th International Con- ference on Computational Linguistics, COLING 2018, Association for Computational Linguistics, 2018, pp. 2066–2081. URL: https://www.aclweb.org/anthology/C18-1176/. [18] C. Stab, J. Daxenberger, C. Stahlhut, T. Miller, B. Schiller, C. Tauchmann, S. Eger, I. Gurevych, ArgumenText: Searching for arguments in heterogeneous sources, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, NAACL 2018, Association for Computational Linguistics, 2018, pp. 21–25. URL: https://www.aclweb.org/anthology/N18-5005. [19] Aristotle, G. A. Kennedy, On Rhetoric: A Theory of Civic Discourse, Oxford: Oxford University Press, 2006. [20] H. Wachsmuth, N. Naderi, I. Habernal, Y. Hou, G. Hirst, I. Gurevych, B. Stein, Argu- mentation quality assessment: Theory vs. practice, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Association for Computational Linguistics, 2017, pp. 250–255. URL: https://doi.org/10.18653/v1/P17-2039. [21] M. Potthast, L. Gienapp, F. Euchner, N. Heilenkötter, N. Weidmann, H. Wachsmuth, B. Stein, M. Hagen, Argument search: Assessing argument relevance, in: Proceedings of the 42nd International Conference on Research and Development in Information Retrieval, SIGIR 2019, ACM, 2019, pp. 1117–1120. URL: https://doi.org/10.1145/3331184.3331327. [22] L. Gienapp, B. Stein, M. Hagen, M. Potthast, Efficient pairwise annotation of argument quality, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 5772–5781. URL: https://aclanthology.org/2020.acl-main.511. doi:10.18653/v1/2020.acl-main.511. [23] H. Wachsmuth, B. Stein, Y. Ajjour, "PageRank" for argument relevance, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Association for Computational Linguistics, 2017, pp. 1117–1127. URL: https://doi.org/10.18653/v1/e17-1105. [24] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web., Technical Report 1999-66, Stanford InfoLab, 1999. URL: http://ilpubs. stanford.edu:8090/422/. [25] L. Dumani, P. J. Neumann, R. Schenkel, A framework for argument retrieval - ranking argument clusters by frequency and specificity, in: Proceedings of the 42nd European Conference on IR Research (ECIR 2020), volume 12035 of Lecture Notes in Computer Science, Springer, 2020, pp. 431–445. URL: https://doi.org/10.1007/978-3-030-45439-5_29. [26] L. Dumani, R. Schenkel, Quality aware ranking of arguments, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, Association for Computing Machinery, 2020, pp. 335–344. URL: https://doi.org/10.1007/ 978-3-030-45439-5_29. [27] H. Wachsmuth, N. Naderi, Y. Hou, Y. Bilu, V. Prabhakaran, T. A. Thijm, G. Hirst, B. Stein, Computational argumentation quality assessment in natural language, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, 2017, pp. 176–187. URL: http://aclweb.org/anthology/E17-1017. [28] A. Nadamoto, K. Tanaka, A comparative web browser (CWB) for browsing and comparing web pages, in: Proceedings of the 12th International World Wide Web Conference, WWW 2003, ACM, 2003, pp. 727–735. URL: https://doi.org/10.1145/775152.775254. [29] J. Sun, X. Wang, D. Shen, H. Zeng, Z. Chen, CWS: A comparative web search system, in: Proceedings of the 15th International Conference on World Wide Web, WWW 2006, ACM, 2006, pp. 467–476. URL: https://doi.org/10.1145/1135777.1135846. [30] N. Jindal, B. Liu, Identifying comparative sentences in text documents, in: Proceedings of the 29th Annual International Conference on Research and Development in Information Retrieval, SIGIR 2006, ACM, 2006, pp. 244–251. URL: https://doi.org/10.1145/1148170. 1148215. [31] N. Jindal, B. Liu, Mining comparative sentences and relations, in: Proceedings of the 21st National Conference on Artificial Intelligence and the 18th Innovative Applications of Artificial Intelligence Conference, AAAI 2006, AAAI Press, 2006, pp. 1331–1336. URL: http://www.aaai.org/Library/AAAI/2006/aaai06-209.php. [32] W. Kessler, J. Kuhn, A corpus of comparisons in product reviews, in: Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014, European Language Resources Association (ELRA), 2014, pp. 2242–2248. URL: http: //www.lrec-conf.org/proceedings/lrec2014/summaries/1001.html. [33] A. Panchenko, A. Bondarenko, M. Franzek, M. Hagen, C. Biemann, Categorizing com- parative sentences, in: Proceedings of the 6th Workshop on Argument Mining, ArgMin- ing@ACL 2019, Association for Computational Linguistics, 2019, pp. 136–145. URL: https://doi.org/10.18653/v1/w19-4516. [34] N. Ma, S. Mazumder, H. Wang, B. Liu, Entity-aware dependency-based deep graph attention network for comparative preference classification, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Associa- tion for Computational Linguistics, 2020, pp. 5782–5788. URL: https://www.aclweb.org/ anthology/2020.acl-main.512/. [35] M. Schildwächter, A. Bondarenko, J. Zenker, M. Hagen, C. Biemann, A. Panchenko, Answering comparative questions: Better than ten-blue-links?, in: Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, CHIIR 2019, ACM, 2019, pp. 361–365. URL: https://doi.org/10.1145/3295750.3298916. [36] V. Chekalina, A. Bondarenko, C. Biemann, M. Beloucif, V. Logacheva, A. Panchenko, Which is better for deep learning: Python or matlab? answering comparative questions in natural language, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, EACL 2021, Association for Computational Linguistics, 2021, pp. 302–311. URL: https://www.aclweb. org/anthology/2021.eacl-demos.36/. [37] A. Bondarenko, P. Braslavski, M. Völske, R. Aly, M. Fröbe, A. Panchenko, C. Biemann, B. Stein, M. Hagen, Comparative web search questions, in: Proceedings of the 13th ACM International Conference on Web Search and Data Mining, WSDM 2020, ACM, 2020, pp. 52–60. URL: https://dl.acm.org/doi/abs/10.1145/3336191.3371848. [38] A. Bondarenko, Y. Ajjour, V. Dittmar, N. Homann, P. Braslavski, M. Hagen, Towards understanding and answering comparative questions, in: Proceedings of the 15th ACM International Conference on Web Search and Data Mining, WSDM 2022, ACM, 2022, pp. 66–74. doi:10.1145/3488560.3498534. [39] I. J. Dove, On images as evidence and arguments, in: F. H. van Eemeren, B. Garssen (Eds.), Topical Themes in Argumentation Theory: Twenty Exploratory Studies, Argu- mentation Library, Springer Netherlands, Dordrecht, 2012, pp. 223–238. doi:10.1007/ 978-94-007-4041-9_15. [40] F. Dunaway, Images, emotions, politics, Modern American History 1 (2018) 369–376. doi:10.1017/mah.2018.17. [41] G. Roque, Visual argumentation: A further reappraisal, in: F. H. van Eemeren, B. Garssen (Eds.), Topical Themes in Argumentation Theory, volume 22, Springer Netherlands, Dordrecht, 2012, pp. 273–288. URL: http://link.springer.com/10.1007/978-94-007-4041-9_ 18. doi:10.1007/978-94-007-4041-9_18, series Title: Argumentation Library. [42] I. Grancea, Types of visual arguments, Argumentum. Journal of the Seminar of Discursive Logic, Argumentation Theory and Rhetoric 15 (2017) 16–34. [43] D. Dimitrov, B. Bin Ali, S. Shaar, F. Alam, F. Silvestri, H. Firooz, P. Nakov, G. Da San Mar- tino, SemEval-2021 task 6: Detection of persuasion techniques in texts and images, in: 15th International Workshop on Semantic Evaluation (SemEval’2021), Association for Computational Linguistics, Online, 2021, pp. 70–98. URL: https://aclanthology.org/2021. semeval-1.7. doi:10.18653/v1/2021.semeval-1.7. [44] N. Chang, K. Fu, Query-by-pictorial-example, IEEE Transactions on Software Engineering 6 (1980) 519–524. doi:10.1109/TSE.1980.230801. [45] P. Aigrain, H. Zhang, D. Petkovic, Content-based representation and retrieval of visual media: A state-of-the-art review, Multimedia Tools and Applications 3 (1996) 179–202. doi:10.1007/BF00393937. [46] A. Latif, A. Rasheed, U. Sajid, J. Ahmed, N. Ali, N. I. Ratyal, B. Zafar, S. H. Dar, M. Sajid, T. Khalil, Content-based image retrieval and feature extraction: A comprehensive review, Mathematical Problems in Engineering 2019 (2019) 21. doi:10.1155/2019/9658350. [47] A. Wu, Learn more about what you see on google images, Google Blog, 2020. URL: https://support.google.com/webmasters/answer/114016. [48] Google, Google images best practices, Google Developers, 2021. URL: https://support. google.com/webmasters/answer/114016. [49] W. Wang, Q. He, A survey on emotional semantic image retrieval, in: International Conference on Image Processing (ICIP 2008), IEEE, 2008, pp. 117–120. doi:10.1109/ ICIP.2008.4711705. [50] M. Solli, R. Lenz, Color emotions for multi-colored images, Color Research & Application 36 (2011) 210–221. doi:10.1002/col.20604. [51] J. Kiesel, N. Reichenbach, B. Stein, M. Potthast, Image retrieval for arguments using stance-aware query expansion, in: Proceedings of the 8th Workshop on Argument Mining, ArgMining 2021 at EMNLP, ACL, 2021, pp. 36–45. [52] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA integrated research architecture, in: Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF, volume 41 of The Information Retrieval Series, Springer, 2019, pp. 123–160. URL: https://doi.org/10.1007/978-3-030-22948-1_5. [53] M. Alshomary, N. Düsterhus, H. Wachsmuth, Extractive snippet generation for arguments, in: Proceedings of the 43nd International ACM Conference on Research and Development in Information Retrieval, SIGIR 2020, ACM, 2020, pp. 1969–1972. URL: https://doi.org/10. 1145/3397271.3401186. [54] J. R. M. Palotti, H. Scells, G. Zuccon, TrecTools: an Open-source Python Library for Information Retrieval Practitioners Involved in TREC-like Campaigns, in: Proceedings of the 42nd International Conference on Research and Development in Information Retrieval, SIGIR 2019, ACM, 2019, pp. 1325–1328. URL: https://doi.org/10.1145/3331184.3331399. [55] R. Bar-Haim, Y. Kantor, E. Venezian, Y. Katz, N. Slonim, Project debater apis: Decomposing the AI grand challenge, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2021, Online and Punta Cana, Dominican Republic, 7-11 November, 2021, Association for Computational Linguistics, 2021, pp. 267–274. URL: https://doi.org/10.18653/v1/2021.emnlp-demo.31. [56] N. Reimers, B. Schiller, T. Beck, J. Daxenberger, C. Stab, I. Gurevych, Classification and clustering of arguments with contextualized word embeddings, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 567–578. URL: https://aclanthology. org/P19-1054. doi:10.18653/v1/P19-1054. [57] S. Gretz, R. Friedman, E. Cohen-Karlik, A. Toledo, D. Lahav, R. Aharonov, N. Slonim, A large-scale dataset for argument quality ranking: Construction and analysis, in: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, AAAI Press, 2020, pp. 7805–7813. URL: https://ojs.aaai.org/index.php/AAAI/article/view/6285. [58] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: Interna- tional conference on machine learning, PMLR, 2014, pp. 1188–1196. [59] M. Honnibal, I. Montani, spacy 2: Natural language understanding with bloom em- beddings, convolutional neural networks and incremental parsing, To appear 7 (2017) 411–420. [60] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using siamese bert- networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Association for Computational Linguistics, 2019, pp. 3980–3990. URL: https://doi.org/10.18653/v1/D19-1410. [61] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9. [62] P. Sülzle, N. Wenzlitschke, Using BERT to retrieve relevant and argumentative sentence pairs, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [63] S. Bahrami, G. P. Goli, A. Pasin, N. Rajkumari, M. M. Sohail, P. Tahan, N. Ferro, SE- UPD@CLEF: Team INTSEG on argument retrieval for controversial questions, in: Work- ing Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [64] B. Moreira, H. Cardoso, B. Martins, F. Goularte, Team Bruce Banner at Touché 2022: Argument retrieval for controversial questions, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [65] J. Lin, X. Ma, S. Lin, J. Yang, R. Pradeep, R. Nogueira, Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021, ACM, 2021, pp. 2356–2362. URL: https://doi.org/10.1145/3404835.3463238. [66] L. Cappellotto, M. Lando, D. Lupu, M. Mariotto, R. Rosalen, N. Ferro, SEUPD@CLEF: Team 6musk on argument retrieval for controversial questions by using pairs selection and query expansion, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [67] M. F. Porter, An algorithm for suffix stripping, Program 14 (1980) 130–137. URL: https: //doi.org/10.1108/eb046814. [68] R. Krovetz, Viewing morphology as an inference process, in: Proceedings of the 16th An- nual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’93, Association for Computing Machinery, New York, NY, USA, 1993, p. 191–202. URL: https://doi.org/10.1145/160688.160718. doi:10.1145/160688.160718. [69] D. D. Lewis, Y. Yang, T. G. Rose, F. Li, RCV1: A new benchmark collection for text categorization research, J. Mach. Learn. Res. 5 (2004) 361–397. URL: http://jmlr.org/ papers/volume5/lewis04a/lewis04a.pdf. [70] G. A. Miller, WordNet: A lexical database for English, Communications of the ACM 38 (1995) 39–41. [71] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, in: Proceedings of the 1st International Conference on Learning Represen- tations, ICLR 2013, 2013. URL: http://arxiv.org/abs/1301.3781. [72] A. Benetti, M. D. Togni, G. Foti, R. Lacini, A. Matteazzi, E. Sgarbossa, N. Ferro, SE- UPD@CLEF: Team Gamora on argument retrieval for controversial questions, in: Work- ing Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [73] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in: A. Moschitti, B. Pang, W. Daelemans (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, ACL, 2014, pp. 1532–1543. URL: https://doi.org/10.3115/v1/d14-1162. [74] M. Barusco, G. D. Fiume, R. Forzan, M. G. Peloso, N. Rizzetto, E. Soleymani, N. Ferro, SEUPD@CLEF: Team Lgtm on argument retrieval for controversial questions, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [75] D. Harman, How effective is suffixing?, Journal of the american society for information science 42 (1991) 7–15. [76] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, D. McClosky, The stanford corenlp natural language processing toolkit, in: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, 2014, pp. 55–60. [77] S. Rose, D. Engel, N. Cramer, W. Cowley, Automatic keyword extraction from individual documents, Text mining: applications and theory 1 (2010) 10–1002. [78] C. Hutto, E. Gilbert, Vader: A parsimonious rule-based model for sentiment analysis of social media text, in: Proceedings of the international AAAI conference on web and social media, volume 8, 2014, pp. 216–225. [79] J. P. Kincaid, R. P. Fishburne Jr, R. L. Rogers, B. S. Chissom, Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel, Technical Report, Naval Technical Training Command Millington TN Research Branch, 1975. [80] M. S. Ebrahimi, A. Crivellari, A. Sah, M. Hansen, S. Mehrbanou, P. Ashok, N. Ferro, SEUPD@CLEF: SPAM on argument retrieval for controversial questions, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [81] J. Wuerf, Similar but different: Simple re-ranking approaches for argument retrieval, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [82] J. G. Carbonell, J. Goldstein, The use of mmr, diversity-based reranking for reordering documents and producing summaries, in: W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, J. Zobel (Eds.), SIGIR ’98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 24-28 1998, Melbourne, Australia, ACM, 1998, pp. 335–336. URL: https://doi.org/10.1145/ 290941.291025. doi:10.1145/290941.291025. [83] M. J. Kusner, Y. Sun, N. I. Kolkin, K. Q. Weinberger, From word embeddings to doc- ument distances, in: F. R. Bach, D. M. Blei (Eds.), Proceedings of the 32nd Interna- tional Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, vol- ume 37 of JMLR Workshop and Conference Proceedings, JMLR.org, 2015, pp. 957–966. URL: http://proceedings.mlr.press/v37/kusnerb15.html. [84] C. V. Ta, F. Reiner, I. von Detten, F. Stöhr, Finding pairs of argumentative sentences using embeddings, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [85] S. Schmidt, J. Probst, B. Bartelt, A. Hinz, Two-stage retrieval for pairs of argumentative sentences, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [86] C. Zhai, J. D. Lafferty, A study of smoothing methods for language models applied to ad hoc information retrieval, in: Proceedings of the 24th International Conference on Research and Development in Information Retrieval, SIGIR 2001, ACM, 2001, pp. 334–342. URL: https://doi.org/10.1145/383952.384019. [87] G. Amati, Frequentist and bayesian approach to information retrieval, in: M. Lalmas, A. MacFarlane, S. M. Rüger, A. Tombros, T. Tsikrika, A. Yavlinsky (Eds.), Advances in Information Retrieval, 28th European Conference on IR Research, ECIR 2006, London, UK, April 10-12, 2006, Proceedings, volume 3936 of Lecture Notes in Computer Science, Springer, 2006, pp. 13–24. URL: https://doi.org/10.1007/11735106_3. doi:10.1007/11735106\_3. [88] C. Macdonald, N. Tonellotto, Declarative experimentation in information retrieval using pyterrier, in: K. Balog, V. Setty, C. Lioma, Y. Liu, M. Zhang, K. Berberich (Eds.), ICTIR ’20: The 2020 ACM SIGIR International Conference on the Theory of Information Retrieval, Virtual Event, Norway, September 14-17, 2020, ACM, 2020, pp. 161–168. URL: https: //doi.org/10.1145/3409256.3409829. doi:10.1145/3409256.3409829. [89] L. Gienapp, B. Stein, M. Hagen, M. Potthast, Efficient pairwise annotation of argument quality, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Association for Computational Linguistics, 2020, pp. 5772–5781. URL: https://www.aclweb.org/anthology/2020.acl-main.511/. [90] G. Amati, C. J. van Rijsbergen, Probabilistic models of information retrieval based on measuring the divergence from randomness, ACM Trans. Inf. Syst. 20 (2002) 357–389. URL: http://doi.acm.org/10.1145/582415.582416. doi:10.1145/582415.582416. [91] M. Kaszkiel, J. Zobel, Passage retrieval revisited, in: N. J. Belkin, A. D. Narasimhalu, P. Willett, W. R. Hersh, F. Can, E. M. Voorhees (Eds.), Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1997, Philadelphia, PA, USA, July 27-31, 1997, ACM, 1997, pp. 178–185. URL: https://doi.org/10.1145/258525.258561. [92] M. Fröbe, J. Bevendorff, L. Gienapp, M. Völske, B. Stein, M. Potthast, M. Hagen, Copycat: Near-duplicates within and between the clueweb and the common crawl, in: Proceedings of the 44th International ACM Conference on Research and Development in Information Retrieval, SIGIR 2021, ACM, 2021, pp. 2398–2404. URL: https://dl.acm.org/doi/10.1145/ 3404835.3463246. [93] M. Fröbe, J. Bevendorff, J. Reimer, M. Potthast, M. Hagen, Sampling bias due to near- duplicates in learning to rank, in: Proceedings of the 43rd International ACM Conference on Research and Development in Information Retrieval, SIGIR 2020, ACM, 2020, pp. 1997–2000. URL: https://dl.acm.org/doi/10.1145/3397271.3401212. [94] M. Fröbe, J. Bittner, M. Potthast, M. Hagen, The effect of content-equivalent near- duplicates on the evaluation of search engines, in: Proceedings of the 42nd Euro- pean Conference on IR Research (ECIR 2020), volume 12036 of Lecture Notes in Com- puter Science, Springer, 2020, pp. 12–19. URL: https://link.springer.com/chapter/10.1007% 2F978-3-030-45442-5_2. [95] R. Nogueira, J. Lin, A. Epistemic, From doc2query to doctttttquery, Online preprint (2019). URL: https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_ docTTTTTquery-v2.pdf. [96] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, MS MARCO: A human generated MAchine Reading COmprehension dataset, in: Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches 2016 at NIPS, volume 1773 of CEUR Workshop Proceedings, CEUR-WS.org, 2016. URL: http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf. [97] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A lite BERT for self-supervised learning of language representations, in: Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, OpenReview.net, 2020. URL: https://openreview.net/forum?id=H1eA7AEtvS. [98] A. Chernodub, O. Oliynyk, P. Heidenreich, A. Bondarenko, M. Hagen, C. Biemann, A. Panchenko, TARGER: Neural argument mining at your fingertips, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, ACL, 2019, pp. 195–200. URL: https://doi.org/10.18653/v1/p19-3031. [99] O. Khattab, M. Zaharia, ColBERT: efficient and effective passage search via contextualized late interaction over BERT, in: J. Huang, Y. Chang, X. Cheng, J. Kamps, V. Murdock, J. Wen, Y. Liu (Eds.), Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, ACM, 2020, pp. 39–48. URL: https://doi.org/10.1145/3397271.3401075. [100] R. Pradeep, R. Nogueira, J. Lin, The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models, CoRR abs/2101.05667 (2021). URL: https: //arxiv.org/abs/2101.05667. arXiv:2101.05667. [101] A. Rana, P. Golchha, R. Juntunen, A. Coajă, A. Elzamarany, C.-C. Hung, S. P. Ponzetto, LeviRank: Limited query expansion with voting integration for document retrieval and ranking, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [102] S. Bird, E. Klein, E. Loper, Natural Language Processing with Python, O’Reilly, 2009. URL: http://www.oreilly.de/catalog/9780596516499/index.html. [103] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy- anov, RoBERTa: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692. [104] A. Williams, N. Nangia, S. R. Bowman, A broad-coverage challenge corpus for sentence understanding through inference, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, Association for Computational Linguistics, 2018, pp. 1112–1122. URL: https://doi.org/10.18653/v1/n18-1101. [105] M. Aba, M. Azra, M. Gallo, O. Mohammad, I. Piacere, G. Virginio, N. Ferro, Aldo Nadi at Touché 2022: Argument retrieval for comparative question, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [106] M. F. Porter, Snowball: A language for stemming algorithms, 2001. URL: http://snowball. tartarus.org/texts/introduction.html. [107] J. Rocchio, Relevance feedback in information retrieval, The Smart retrieval system- experiments in automatic document processing (1971) 313–323. [108] G. V. Cormack, C. L. A. Clarke, S. Büttcher, Reciprocal rank fusion outperforms condorcet and individual rank learning methods, in: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, ACM, 2009, pp. 758–759. URL: https://doi.org/10.1145/1571941.1572114. [109] V. Chekalina, A. Panchenko, Retrieving comparative arguments using deep language models, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [110] A. Chimetto, D. Peressoni, E. Sabbatini, G. Tommasin, M. Varotto, A. Zanardelli, N. Ferro, SEUPD@CLEF: Team Hextech on argument retrieval for comparative questions. the importance of adjectives in documents quality evaluation, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [111] J. B. Lovins, Development of a stemming algorithm, Mech. Transl. Comput. Linguistics 11 (1968) 22–31. URL: http://www.mt-archive.info/MT-1968-Lovins.pdf. [112] P. Rajula, C.-C. Hung, S. P. Ponzetto, Stacked model based argument extraction and stance detection using embedded LSTM model, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [113] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter, CoRR abs/1910.01108 (2019). URL: http://arxiv.org/abs/1910. 01108. arXiv:1910.01108. [114] A. Alhamzeh, M. Bouhaouel, E. Egyed-Zsigmond, J. Mitrovic, Distilbert-based ar- gumentation retrieval for answering comparative questions, in: Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, vol- ume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 2319–2330. URL: http://ceur-ws.org/Vol-2936/paper-209.pdf. [115] J. H. Reimer, J. Huck, A. Bondarenko, Grimjack at Touché 2022: Axiomatic re-ranking and query reformulation, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [116] A. Bondarenko, M. Fröbe, V. Kasturia, M. Völske, B. Stein, M. Hagen, Webis at TREC 2019: Decision track, in: E. Voorhees, A. Ellis (Eds.), Proceedings of the 28th International Text Retrieval Conference, TREC 2019, NIST, 2019. [117] J. Bevendorff, A. Bondarenko, M. Fröbe, S. Günther, M. Völske, B. Stein, M. Hagen, Webis at TREC 2020: Health Misinformation track, in: E. Voorhees, A. Ellis (Eds.), Proceedings of the 29th International Text Retrieval Conference, TREC 2020, NIST, 2020. [118] P. Rösner, N. Arnhold, T. Xylander, Quality-aware argument re-ranking for comparative questions, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [119] G. Erkan, D. R. Radev, LexRank: Graph-based lexical centrality as salience in text summarization, J. Artif. Intell. Res. 22 (2004) 457–479. URL: https://doi.org/10.1613/jair. 1523. [120] G. V. Cormack, M. D. Smucker, C. L. A. Clarke, Efficient and effective spam filtering and re-ranking for large web datasets, Inf. Retr. 14 (2011) 441–465. URL: https://doi.org/10. 1007/s10791-011-9162-z. [121] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res. 3 (2003) 993–1022. URL: http://jmlr.org/papers/v3/blei03a.html. [122] T. K. Ho, Random decision forests, in: Third International Conference on Document Analysis and Recognition, ICDAR 1995, August 14 - 15, 1995, Montreal, Canada. Volume I, IEEE Computer Society, 1995, pp. 278–282. URL: https://doi.org/10.1109/ICDAR.1995. 598994. [123] D. Hovy, T. Berg-Kirkpatrick, A. Vaswani, E. Hovy, Learning whom to trust with MACE, in: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HTL 2013), Association for Computational Linguistics, Atlanta, Georgia, 2013, pp. 1120–1130. URL: https://aclanthology.org/N13-1132. [124] J. Braker, L. Heinemann, T. Schreieder, Aramis at Touché 2022: Argument detection in pictures using machine learning, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [125] M. Zaid, L. George, G. Al-Khafaji, Distinguishing cartoons images from real-life im- ages, International Journal of Advanced Research in Computer Science and Software Engineering 5 (2015) 91–95. [126] T. Brummerloh, M. L. Carnot, S. Lange, G. Pfänder, Boromir at Touché 2022: Combining natural language processing and machine learning techniques for image retrieval for arguments, in: Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [127] F. Å. Nielsen, A new ANEW: evaluation of a word list for sentiment analysis in microblogs, in: M. Rowe, M. Stankovic, A. Dadzie, M. Hardey (Eds.), Proceedings of the ESWC2011 Workshop on ’Making Sense of Microposts’, volume 718 of CEUR Workshop Proceedings, CEUR-WS.org, 2011, pp. 93–98. URL: http://ceur-ws.org/Vol-718/paper_16.pdf. A. Full Evaluation Results of Touché 2022: Argument Retrieval Table 6 Relevance results of all runs submitted to Task 1: Argument Retrieval for Controversial Questions. Reported are the mean nDCG@5 and the 95% confidence intervals. The baseline Swordsman is shown in bold. Team Run Tag nDCG@5 Mean Low High Porthos scl_dlm_bqnc_acl_nsp 0.742 0.670 0.807 Daario Naharis INTSEG-Letter-no_stoplist-Krovetz-Icoef-Evidence-Par 0.683 0.609 0.755 Daario Naharis INTSEG-Run-Whitespace-Krovetz-Stoplist-Pos-Evidence-icoeff-Sep 0.676 0.587 0.762 Daario Naharis INTSEG-Run-letter-english-2-20-no_stoplist-pos-evidence-icoef-An 0.670 0.589 0.751 Bruce Banner Bruce-Banner_pyserinin_sparse_v3 0.651 0.573 0.720 D’Artagnan seupd2122-6musk-kstem-stop-shingle3 0.642 0.575 0.705 Bruce Banner Bruce-Banner_pyserinin_sparse_v1 0.641 0.575 0.705 Daario Naharis INTSEG-Whitespace-Stoplist-Krovetz-Icoef-Sep 0.629 0.549 0.706 Gamora seupd2122-javacafe-gamoraHeuristicsOnlyQueryReductionDoubleIndex 0.616 0.551 0.687 D’Artagnan seupd2122-6musk-stop-wordnet-kstem-dirichlet 0.608 0.542 0.682 D’Artagnan seupd2122-6musk-stop-kstem-concsearch 0.591 0.514 0.667 Hit Girl Io 0.588 0.515 0.657 Gamora seupd2122-javacafe-gamoraStandardDoubleIndex 0.588 0.518 0.655 Bruce Banner Bruce-Banner_pyserinin_sparse_v4 0.586 0.509 0.658 D’Artagnan seupd2122-6musk-word2vec-sentences-kstem 0.585 0.506 0.661 Gamora seupd2122-javacafe-gamoraHeuristicsDoubleIndex 0.584 0.512 0.656 Hit Girl Ganymede 0.583 0.513 0.650 Bruce Banner Bruce-Banner_pyserinin_sparse_v2 0.580 0.507 0.654 Hit Girl Jupiter 0.560 0.484 0.631 Hit Girl Europa 0.546 0.477 0.615 Gamora seupd2122-javacafe-gamora_tfidf_kstemstopengpos_multi_YYY 0.516 0.446 0.581 Gamora seupd2122-javacafe-gamora_sbert_kstemstopengpos_multi_YYY 0.497 0.407 0.586 Pearl PearlBlocklist_WeightedRelevance 0.481 0.399 0.560 Pearl PearlArgRank8040_WeightedRelevance 0.479 0.403 0.556 Pearl PearlArgRank7530 0.470 0.391 0.547 Pearl PearlBlocklist 0.466 0.380 0.549 Pearl PearlArgRank8040 0.465 0.389 0.551 Gorgon GorgonA2Bm25 0.408 0.354 0.461 Daario Naharis INTSEG-Run-Whitespace-Porter-Wordnet-Pos-no_stoplist-tfidf-An 0.406 0.305 0.510 General Grievous seupd2122-lgtm_QE_NRR 0.403 0.335 0.471 General Grievous seupd2122-lgtm_NQE_NRR 0.402 0.335 0.476 Gorgon GorgonA1Bm25 0.396 0.350 0.442 Gorgon GorgonBasicBM25 0.387 0.330 0.439 General Grievous seupd2122-lgtm_NQE_NRR_ONLY_TITLE 0.386 0.314 0.451 General Grievous seupd2122-lgtm_QE_NRR_ONLY_TITLE 0.386 0.317 0.450 Gorgon GorgonKEBM25 0.378 0.329 0.428 Swordsman baseline_swordsman 0.356 0.296 0.412 Gorgon GorgonBasicLMD 0.315 0.269 0.362 D’Artagnan seupd2122-6musk-stop-kstem-basic 0.300 0.229 0.369 Korg korg9000 0.252 0.187 0.318 Porthos scl_dlm_bqnc_acl_nsp_100_test 0.244 0.215 0.275 Table 7 Quality results of all runs submitted to Task 1: Argument Retrieval for Controversial Questions. Reported are the mean nDCG@5 and the 95% confidence intervals. The baseline Swordsman is shown in bold. Team Run Tag nDCG@5 Mean Low High Daario Naharis INTSEG-Letter-no_stoplist-Krovetz-Icoef-Evidence-Par 0.913 0.870 0.947 Daario Naharis INTSEG-Run-letter-english-2-20-no_stoplist-pos-evidence-icoef-An 0.898 0.855 0.941 Daario Naharis INTSEG-Run-Whitespace-Krovetz-Stoplist-Pos-Evidence-icoeff-Sep 0.896 0.841 0.944 Porthos scl_dlm_bqnc_acl_nsp 0.873 0.825 0.913 Gamora seupd2122-javacafe-gamoraHeuristicsOnlyQueryReductionDoubleIndex 0.785 0.729 0.848 Gamora seupd2122-javacafe-gamoraHeuristicsDoubleIndex 0.779 0.716 0.835 Daario Naharis INTSEG-Whitespace-Stoplist-Krovetz-Icoef-Sep 0.776 0.712 0.839 Hit Girl Ganymede 0.776 0.707 0.840 Bruce Banner Bruce-Banner_pyserinin_sparse_v1 0.772 0.702 0.830 Bruce Banner Bruce-Banner_pyserinin_sparse_v3 0.760 0.680 0.832 Gamora seupd2122-javacafe-gamora_tfidf_kstemstopengpos_multi_YYY 0.755 0.686 0.823 Gamora seupd2122-javacafe-gamora_sbert_kstemstopengpos_multi_YYY 0.743 0.656 0.823 Gorgon GorgonA2Bm25 0.742 0.700 0.786 D’Artagnan seupd2122-6musk-stop-wordnet-kstem-dirichlet 0.733 0.676 0.787 Gamora seupd2122-javacafe-gamoraStandardDoubleIndex 0.731 0.672 0.786 Gorgon GorgonA1Bm25 0.729 0.686 0.774 D’Artagnan seupd2122-6musk-kstem-stop-shingle3 0.728 0.657 0.794 D’Artagnan seupd2122-6musk-stop-kstem-concsearch 0.727 0.659 0.786 Hit Girl Jupiter 0.725 0.651 0.796 Gorgon GorgonKEBM25 0.724 0.677 0.769 D’Artagnan seupd2122-6musk-word2vec-sentences-kstem 0.723 0.659 0.787 Hit Girl Europa 0.721 0.643 0.793 Hit Girl Io 0.719 0.643 0.797 Bruce Banner Bruce-Banner_pyserinin_sparse_v4 0.709 0.624 0.783 Bruce Banner Bruce-Banner_pyserinin_sparse_v2 0.701 0.610 0.783 Gorgon GorgonBasicBM25 0.685 0.634 0.734 Gorgon GorgonBasicLMD 0.679 0.634 0.726 Pearl PearlArgRank7530 0.678 0.609 0.744 Daario Naharis INTSEG-Run-Whitespace-Porter-Wordnet-Pos-no_stoplist-tfidf-An 0.671 0.585 0.753 Pearl PearlBlocklist_WeightedRelevance 0.670 0.605 0.734 Pearl PearlBlocklist 0.670 0.605 0.729 Pearl PearlArgRank8040_WeightedRelevance 0.670 0.601 0.735 Pearl PearlArgRank8040 0.668 0.595 0.737 Swordsman baseline_swordsman 0.608 0.543 0.671 General Grievous seupd2122-lgtm_QE_NRR 0.517 0.444 0.583 General Grievous seupd2122-lgtm_NQE_NRR 0.517 0.442 0.591 General Grievous seupd2122-lgtm_NQE_NRR_ONLY_TITLE 0.475 0.387 0.559 General Grievous seupd2122-lgtm_QE_NRR_ONLY_TITLE 0.475 0.392 0.555 Korg korg9000 0.453 0.384 0.529 D’Artagnan seupd2122-6musk-stop-kstem-basic 0.441 0.357 0.517 Porthos scl_dlm_bqnc_acl_nsp_100_test 0.274 0.247 0.301 Table 8 Coherence results of all runs submitted to Task 1: Argument Retrieval for Controversial Questions. Reported are the mean nDCG@5 and the 95% confidence intervals. The baseline Swordsman is shown in bold. Team Run Tag nDCG@5 Mean Low High Daario Naharis INTSEG-Run-Whitespace-Krovetz-Stoplist-Pos-Evidence-icoeff-Sep 0.458 0.389 0.525 Daario Naharis INTSEG-Letter-no_stoplist-Krovetz-Icoef-Evidence-Par 0.444 0.375 0.508 Porthos scl_dlm_bqnc_acl_nsp 0.429 0.353 0.509 Daario Naharis INTSEG-Run-letter-english-2-20-no_stoplist-pos-evidence-icoef-An 0.407 0.331 0.489 Pearl PearlArgRank7530 0.398 0.311 0.485 Pearl PearlArgRank8040 0.396 0.311 0.481 Pearl PearlBlocklist 0.392 0.307 0.475 D’Artagnan seupd2122-6musk-kstem-stop-shingle3 0.378 0.311 0.452 Bruce Banner Bruce-Banner_pyserinin_sparse_v1 0.378 0.300 0.459 Hit Girl Ganymede 0.377 0.303 0.456 Pearl PearlBlocklist_WeightedRelevance 0.369 0.287 0.450 Pearl PearlArgRank8040_WeightedRelevance 0.369 0.291 0.443 Hit Girl Io 0.365 0.302 0.430 D’Artagnan seupd2122-6musk-stop-wordnet-kstem-dirichlet 0.358 0.292 0.427 Bruce Banner Bruce-Banner_pyserinin_sparse_v4 0.357 0.273 0.446 Bruce Banner Bruce-Banner_pyserinin_sparse_v3 0.354 0.272 0.444 Bruce Banner Bruce-Banner_pyserinin_sparse_v2 0.353 0.283 0.433 Hit Girl Europa 0.349 0.287 0.415 D’Artagnan seupd2122-6musk-stop-kstem-concsearch 0.336 0.270 0.400 D’Artagnan seupd2122-6musk-word2vec-sentences-kstem 0.333 0.274 0.400 Hit Girl Jupiter 0.330 0.269 0.394 Daario Naharis INTSEG-Whitespace-Stoplist-Krovetz-Icoef-Sep 0.288 0.216 0.361 Gamora seupd2122-javacafe-gamora_sbert_kstemstopengpos_multi_YYY 0.285 0.203 0.373 Gorgon GorgonKEBM25 0.282 0.233 0.335 Gamora seupd2122-javacafe-gamoraHeuristicsOnlyQueryReductionDoubleIndex 0.276 0.204 0.347 Gamora seupd2122-javacafe-gamora_tfidf_kstemstopengpos_multi_YYY 0.276 0.200 0.372 Gorgon GorgonBasicBM25 0.274 0.210 0.334 D’Artagnan seupd2122-6musk-stop-kstem-basic 0.273 0.207 0.346 Gamora seupd2122-javacafe-gamoraHeuristicsDoubleIndex 0.272 0.203 0.343 Gorgon GorgonA2Bm25 0.259 0.209 0.314 Swordsman baseline_swordsman 0.248 0.193 0.303 Gorgon GorgonA1Bm25 0.246 0.197 0.301 General Grievous seupd2122-lgtm_QE_NRR 0.231 0.162 0.313 General Grievous seupd2122-lgtm_NQE_NRR 0.228 0.164 0.299 Gorgon GorgonBasicLMD 0.225 0.162 0.289 General Grievous seupd2122-lgtm_NQE_NRR_ONLY_TITLE 0.220 0.158 0.283 General Grievous seupd2122-lgtm_QE_NRR_ONLY_TITLE 0.219 0.160 0.283 Daario Naharis INTSEG-Run-Whitespace-Porter-Wordnet-Pos-no_stoplist-tfidf-An 0.203 0.137 0.280 Gamora seupd2122-javacafe-gamoraStandardDoubleIndex 0.195 0.139 0.252 Korg korg9000 0.168 0.117 0.223 Porthos scl_dlm_bqnc_acl_nsp_100_test 0.105 0.070 0.144 Table 9 Relevance results of all runs submitted to Task 2: Argument Retrieval for Comparative Questions. Reported are the mean nDCG@5 and the 95% confidence intervals; Puss in Boots baseline in bold. Team Run Tag nDCG@5 Mean Low High Captain Levi levirank_dense_initial_retrieval 0.758 0.708 0.805 Captain Levi levirank_baseline_large_duo_t5 0.755 0.711 0.805 Captain Levi levirank_psuedo_relevance_feedback+voting 0.753 0.713 0.797 Captain Levi levirank_voting_retrieval 0.727 0.674 0.779 Captain Levi levirank_psuedo_relevance_feedback 0.722 0.663 0.777 Aldo Nadi seupd2122-kueri_rrf_reranked 0.709 0.648 0.766 Aldo Nadi seupd2122-kueri_RF_reranked 0.695 0.629 0.756 Aldo Nadi seupd2122-kueri_rrf 0.668 0.591 0.744 Aldo Nadi seupd2122-kueri_[. . . ]_porter_reranked 0.636 0.568 0.701 Katana Colbert edinburg 0.618 0.553 0.678 Katana Colbert trained by me 0.601 0.532 0.674 Captain Tempesta hextech_run_1 0.574 0.499 0.641 Captain Tempesta hextech_run_2 0.569 0.499 0.633 Captain Tempesta hextech_run_3 0.564 0.488 0.635 Katana Colbert fine tune on touche data 0.562 0.488 0.630 Captain Tempesta hextech_run_5 0.557 0.483 0.624 Aldo Nadi seupd2122-kueri_[. . . ]_porter 0.546 0.473 0.620 Captain Tempesta hextech_run_4 0.536 0.460 0.609 Olivier Armstrong tfid_arg_similarity 0.492 0.422 0.564 Puss in Boots BM25-Baseline 0.469 0.403 0.535 Grimjack grimjack-fair-reranking-argumentative-axioms 0.422 0.349 0.500 Grimjack grimjack-argumentative-axioms 0.376 0.299 0.455 Grimjack grimjack-baseline 0.376 0.301 0.459 Grimjack grimjack-fair-argumentative-reranking-with-t0 0.349 0.270 0.425 Grimjack grimjack-all-you-need-is-t0 0.345 0.273 0.425 Asuna asuna-run-5 0.263 0.198 0.328 Table 10 Quality results of all runs submitted to Task 2: Argument Retrieval for Comparative Questions. Reported are the mean nDCG@5 and the 95% confidence intervals; Puss in Boots baseline in bold. Team Run Tag nDCG@5 Mean Low High Aldo Nadi seupd2122-kueri_RF_reranked 0.774 0.717 0.829 Aldo Nadi seupd2122-kueri_[. . . ]_porter_reranked 0.764 0.701 0.823 Aldo Nadi seupd2122-kueri_rrf_reranked 0.748 0.687 0.807 Captain Levi levirank_dense_initial_retrieval 0.744 0.694 0.804 Captain Levi levirank_baseline_large_duo_t5 0.742 0.681 0.800 Captain Levi levirank_psuedo_relevance_feedback+voting 0.730 0.672 0.789 Captain Levi levirank_voting_retrieval 0.706 0.639 0.774 Captain Levi levirank_psuedo_relevance_feedback 0.695 0.625 0.753 Aldo Nadi seupd2122-kueri_rrf 0.664 0.589 0.735 Katana Colbert trained by me 0.644 0.574 0.714 Katana Colbert edinburg 0.643 0.577 0.709 Katana Colbert fine tune on touche data 0.637 0.556 0.718 Captain Tempesta hextech_run_5 0.597 0.521 0.676 Captain Tempesta hextech_run_2 0.593 0.518 0.670 Captain Tempesta hextech_run_1 0.589 0.508 0.667 Captain Tempesta hextech_run_3 0.584 0.506 0.660 Olivier Armstrong tfid_arg_similarity 0.582 0.502 0.656 Aldo Nadi seupd2122-kueri_[. . . ]_porter 0.570 0.490 0.647 Captain Tempesta hextech_run_4 0.566 0.490 0.641 Puss in Boots BM25-Baseline 0.476 0.400 0.553 Grimjack grimjack-fair-reranking-argumentative-axioms 0.403 0.331 0.478 Grimjack grimjack-fair-argumentative-reranking-with-t0 0.365 0.290 0.445 Grimjack grimjack-argumentative-axioms 0.363 0.289 0.442 Grimjack grimjack-baseline 0.363 0.287 0.443 Grimjack grimjack-all-you-need-is-t0 0.344 0.266 0.428 Asuna asuna-run-5 0.332 0.254 0.417 Table 11 Stance detection results of all runs submitted to Task 2: Argument Retrieval for Comparative Questions. Reported are a macro-averaged F1 for each team and run and number of documents N for which the stance was predicted; Puss in Boots baseline that always predicts ‘no stance’ is in bold. Team Tag F1 run N run F1 team N team Grimjack grimjack-all-you-need-is-t0 0.313 1208 0.235 1386 Captain Levi levirank_dense_initial_retrieval 0.301 1688 0.261 2020 Captain Levi levirank_baseline_large_duo_t5 0.295 1960 0.261 2020 Captain Levi levirank_psuedo_relevance_feedback 0.246 1948 0.261 2020 Captain Levi levirank_voting_retrieval 0.236 1897 0.261 2020 Katana Colbert edinburg 0.229 1027 0.220 1301 Katana Colbert trained by me 0.221 1079 0.220 1301 Captain Levi levirank_psuedo_relevance_feedback+voting 0.218 1822 0.261 2020 Katana Colbert fine tune on touche data 0.212 940 0.220 1301 Grimjack grimjack-argumentative-axioms 0.207 1282 0.235 1386 Grimjack grimjack-baseline 0.207 1282 0.235 1386 Grimjack grimjack-fair-reranking-argumentative-axioms 0.207 1282 0.235 1386 Grimjack grimjack-fair-argumentative-reranking-with-t0 0.199 1180 0.235 1386 Olivier Armstrong tfid_arg_similarity 0.191 551 0.191 551 Puss in Boots Always-NO-Baseline 0.158 1328 0.158 1328 Asuna asuna-run-5 0.106 578 0.106 578 Table 12 Results of all runs submitted to Task 3 Image Retrieval. Reported are the mean precision@10 (per stance) for topic relevance, argumentativeness, and stance relevance and the 95% confidence intervals (low and high). Results for the baseline are shown in bold. Precision@10 Topic Arg. Stance Team Run Mean Low High Mean Low High Mean Low High Boromir BERT, OCR, query-processing 0.878 0.847 0.904 0.768 0.733 0.799 0.425 0.398 0.451 Boromir BERT, OCR, clustering, query-processing 0.822 0.782 0.863 0.728 0.685 0.772 0.411 0.383 0.442 Boromir AFINN, OCR 0.814 0.774 0.851 0.726 0.680 0.768 0.408 0.379 0.436 Minsc Baseline 0.736 0.693 0.774 0.686 0.638 0.734 0.407 0.367 0.445 Boromir AFINN, OCR, clustering 0.749 0.705 0.792 0.674 0.625 0.721 0.384 0.354 0.414 Boromir AFINN, OCR, clustering, query-processing 0.767 0.722 0.812 0.688 0.645 0.734 0.382 0.352 0.412 Aramis Argumentativeness:formula, stance:formula 0.701 0.658 0.744 0.634 0.594 0.674 0.381 0.349 0.412 Aramis Argumentativeness:neural, stance:formula 0.687 0.640 0.732 0.632 0.587 0.674 0.365 0.332 0.395 Aramis Argumentativeness:neural, stance:neural 0.673 0.629 0.717 0.624 0.583 0.666 0.354 0.320 0.385 Jester With emotion detection 0.696 0.654 0.736 0.647 0.601 0.688 0.350 0.316 0.382 Aramis Argumentativeness:formula, stance:neural 0.664 0.622 0.710 0.609 0.568 0.646 0.344 0.317 0.371 Jester Without emotion detection 0.671 0.635 0.712 0.618 0.577 0.656 0.336 0.308 0.366 Boromir AFINN, clustering 0.600 0.549 0.649 0.545 0.495 0.595 0.319 0.285 0.351