Grimjack at Touché 2022: Axiomatic Re-ranking and Query Reformulation Notebook for the Touché Lab on Argument Retrieval at CLEF 2022 Jan Heinrich Reimer, Johannes Huck and Alexander Bondarenko Martin-Luther-Universität Halle-Wittenberg, 06099 Halle (Saale), Germany Abstract In this paper, we present the Team’s Grimjack retrieval approaches for the Touché shared task on Argument Retrieval for Comparative Questions. In total, we submit five runs that pursue the two main objectives: favoring argumentative and high argument quality documents in the final ranking and balancing stance-based exposure by ensuring an even ratio of pro and con arguments at top ranks. Our results indicate that BM25 outperforms query likelihood ranking for initial passage retrieval and that stance-based re-ranking can slightly improve a ranking effectiveness. For stance classification, prompting the T0 zero-shot language model is the best-performing approach when considering all available ground-truth labels. Keywords Axiomatic Re-ranking, Query Reformulation, Comparative Questions, Argument Quality, Argument Stance 1. Introduction Argument retrieval is a specific task that not only considers topical relevance of retrieved documents to given queries (usually of controversial, argumentative, or opinion nature) but also accounts for argument specific features like argument quality and stance [1, 2]. Furthermore, it has been shown that current search engines might return biased results [3] and argument retrieval systems return imbalanced pro / con arguments [4]. We especially emphasize the importance of retrieving diverse results for comparative questions (e.g., “Train or plane? Which is the better choice?”) that provide different point of views to mitigate biasing users’ decisions towards one or the other comparison option. Our Team Grimjack participated in the Touché shared task on Argument Retrieval for Com- parative Questions which goals are: (1) To retrieve relevant and high quality argumentative passages from a collection of 868 655 text passages to a set of 50 search topics and (2) to classify the stance of the retrieved passages towards the comparison objects in search topics [5]. As part of our participation in the task, we have developed a flexible retrieval pipeline in Python based CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ jan.reimer@student.uni-halle.de (J. H. Reimer); johannes.huck@student.uni-halle.de (J. Huck); alexander.bondarenko@informatik.uni-halle.de (A. Bondarenko) € https://heinrich.reimer.family (J. H. Reimer); https://github.com/johannes-huck (J. Huck); https://sites.google.com/view/alexanderbondarenko (A. Bondarenko)  0000-0003-1992-8696 (J. H. Reimer); 0000-0002-1678-0094 (A. Bondarenko) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) on Pyserini [6] as an easily configurable command line application, which we release under a free open source license.1 In the first step, our approach uses query (comparative questions from topics’ titles) reformulation and expansion by important terms from topics’ descriptions and narratives. Then the top-10 initially retrieved passages using query likelihood with Dirichlet smoothing [7] are axiomatically re-ranked based on the number and position of premises, claims (identified with TARGER [8]), and comparison objects, and argument quality predictions by the IBM Debater API [9] and T0++ [10]. Finally, the pro and con argumentative passages towards the compared objects are balanced in the final ranking by alternating documents of different stance (cf. Section 3 for more details on the approach and submitted runs). We also submitted our software using the TIRA platform [11]2 that automatically evaluates submitted approaches and presents the results on a leaderboard. Even though none of our runs (with query likelihood first-stage retrieval) outperform the official BM25 baseline in terms of relevance and rhetorical quality, we observe that stance-based re-ranking can slightly improve a ranking effectiveness while argument axiomatic re-ranking with KwikSort does not change retrieval effectiveness. Our runs using query expansion with the T0++ language model [10] should pose examples to discuss current doubts about the usefulness of large zero-shot language models in the field of search and information retrieval [3] as they are amongst the worst performing runs. For stance classification however, our T0-based approach using zero-shot prompts yields promising results, even though we are unable to directly compare it to other runs due to different test set coverage. 2. Related Work Personal decision making often starts with formulating comparative questions like “Should I major in philosophy or psychology?” [1, 2, 5]. Short direct answers (potentially biased) [12] to such questions might be insufficient; instead, such questions require diverse opinions to provide a sufficient, balanced, and argumentative overview [1]. The Touché shared task on Argument Retrieval for Comparative Questions was proposed to evaluate retrieval approaches on a large corpus with respect to relevance and rhetorical quality of potential answers to comparative questions that also may represent different standpoints [5, 13]. The most effective approaches at previous Touché editions [1, 2] successfully used query expansion with synonyms and antonyms [14], identified premises and claims in retrieved docu- ments [15, 16], estimated argument quality [14], and re-ranked initially retrieved documents based on argument quality and document “comparativeness”, e.g., a ratio of comparative adjec- tives [17]. Inspired by the participant approaches from the previous Touché editions, we also include the components of argument mining and argument quality estimation in our retrieval pipeline, however, using different methods. We rely on a large language model T0 trained in multitask setting that showed to achieve state-of-the-art results for various Natural Language Processing tasks in zero-shot settings [10]. The largest pretrained T0 variant, T0++, was trained on 62 datasets with 12 task-specific prompts covering such tasks as question answering, senti- ment analysis, summarization, etc. By using T0++, we aim for answering a question whether 1 https://github.com/heinrichreimer/grimjack 2 https://tira.io Passages Indexing Retrieval Query Query Topics Expansion // Query Expansion Reformulation Combination Reformulation Quality Stance Tagging Tagging Axiomatic Stance-based Ranking Re-ranking Re-ranking Figure 1: Architecture overview for the modular retrieval pipeline used to produce our runs. Dashed boxes indicate optional steps, that are not used in all runs. the abilities of large language models are sufficient for the new task of argument retrieval. Our second idea of axiomatic re-ranking comes from axiomatic thinking in information retrieval, where axioms formally describe constraints that good retrieval model should fulfil, e.g., documents with more query term occurrences should be ranked higher Fang et al. [18]. It has already been shown that combining multiple axioms for re-ranking results of arbitrary retrieval models can improve final overall retrieval effectiveness [19]. Complementing existing retrieval axioms, Bondarenko et al. [20] introduced argumentativeness axioms based on claims and premises in documents identified with TARGER [8]. 3. Approach We design the architecture of our argumentative retrieval system as a multi-step pipeline that subsequently (re-)ranks, annotates, or modifies documents retrieved for each query with the query likelihood with Dirichlet smoothing (𝜇 = 1 000). As shown in Figure 1, our proposed pipeline consists of four main steps: (1) query expansion, reformulation, and combination, (2) first-stage retrieval, (3) argument quality estimation and stance detection, and (4) axiomatic re-ranking and stance-based re-ranking. 3.1. Query Expansion, Reformulation, and Combination The first step of our retrieval pipeline is original query (task’s topic titles) reformulation and expansion that aims for increasing a recall. For that, we use two different strategies: (1) replacing the comparison objects with their synonyms (e.g., Ubuntu vs. Windows → Linux vs. Windows) Table 1 Original queries (topic titles) provided by Touché and generated queries by T0++ [10] by prompting the topic’s description (D) or narrative (N). Topic Original query Field Generated query Train or plane? Which is D Travel 12 the better choice? N What are the benefits of trains over planes for inter- continental travel? Should I buy steel or D Why should I choose ceramic knives over steel 53 ceramic knives? knives? N What are the pros and cons of ceramic knives? Should I major in D What is the difference between philosophy and psy- 88 philosophy or psychology? chology? N What are the benefits of a major in English or his- tory? Which is more D What are the most environmentally friendly cars? 95 environmentally friendly, a N What is more environmentally friendly, a diesel or a hybrid or a diesel? hybrid car? and (2) generating additional, new queries exploiting the topics’ description and narrative provided by the task organizers [5]. We then address the precision-recall trade-off by deploying re-ranking steps by moving more relevant documents at the top of the ranking (cf. Section 3.4). Query Reformulation with Synonyms. To find synonyms of comparison objects mentioned in questions (search queries), we use two different strategies: (1) word embeddings and (2) a zero-shot generation with pre-trained large language models. For the first strategy, we use fastText word embeddings [21] from PyMagnitude3 to find the word with the highest cosine similarity to the given comparison objects in the embedding space. We manually examine synonyms from the fastText embeddings pre-trained on different corpora (e.g., Wikipedia and Twitter) and find that the Twitter-based embeddings provide more accurate synonyms. Our second strategy is based on the T0++ zero-shot language model [10]. We prompt the model to generate an answer to the following question: What are synonyms of the word ?, where is one of the two original comparison objects. We then process the output by splitting by commas and select the first term that is different from the original query term. With the synonyms returned by either strategy, we replace the comparison objects to formulate new question queries. Query Reformulation with Topic Context. In the next step, we complement the ex- panded queries with two newly generated ones per topic taking into account the contex- tual information from the topic’s descriptions and narratives that contain important details on the actual information. Using the Hugging Face Inference API [22], we prompt T0++ 3 https://gitlab.com/Plasticity/magnitude with the following task: . Extract a natural search query from this descrip- tion., where is either the topic’s narrative or description. In Table 1, we show the examples of generated queries. Albeit some of the generated queries (e.g., topic 53) are just reformulations of the original one, T0++ also generates potentially useful meaningful new queries (e.g., topic 12). Query Combination and Expansion. Finally, we combine up to 5 question queries (refor- mulated with synonyms, generated, and the original one, depending on the submitted run; cf. Section 4) using a logical disjunction (Pyserini’s OR operator). We choose the logical disjunction with the outlook on increasing the system’s recall and decreasing the chance of empty result sets in the case that search terms are not present in the corpus. In total we submitted 5 runs (retrieval results; cf. Section 4) to the task, in some of which we use only the original query, and the expanded queries in the others to test the influence of the query expansion and reformulation on the final ranking results. 3.2. Passage Retrieval To retrieve passages from the task’s corpus, we first build an inverted index using the Pyserini framework [6]. In the index, we store index term positions, passage vectors, and raw passage contents. Index terms are stemmed using the Porter stemmer [23] and stop words are removed as per the default Pyserini stopword list [6]. We then retrieve passages for the previously combined query (cf. Section 3.1) using the query likelihood model with Dirichlet smoothing (𝜇 = 1 000). From this first-stage ranker, we retrieve 100 candidate passages for each query. 3.3. Argument Tagging, Argument Quality and Stance Classification After retrieving candidate passages, we tag the argumentative structure (premises and claims), estimate argument quality, and detect the stance (whether the passage is pro first comparison object, pro second, has neutral, or no stance.). This information is used in later steps of our retrieval pipeline for re-ranking (cf. Section 3.4). We tag each passage’s argumentative structure with the TARGER argument tagger [8] using the targer-api Python package4 . To estimate the passage’s argument quality and detect the stance, we first split each passage into sentences using the NLTK library [24]. Then each sentence is treated as one potential argument; the quality score and stance for the whole passage is calculated by averaging the quality or stance scores for all sentences in the passage. Argument Quality Estimation. We use two different methods for assessing the argument quality. Our first method is based on the IBM Debater API [9].5 The API then determines how good the quality of each argument with regard to the topic is with a Bert-based [25] regression classifier model trained on the IBM-ArgQ-6.3kArgs dataset. The API returns a quality score ranging from 0 (low quality) to 1 (high quality). 4 https://github.com/webis-de/targer-api 5 https://early-access-program.debater.res.ibm.com Table 2 Argument quality label and argument stance label mapping from textual tokens returned T0++ [10]. (a) Argument quality label mapping for the prompt (b) Argument stance label mapping for the prompts How would you rate the readability and Is this sentence pro ? (Pro) and consistency in this sentence?. Is this sentence against ? (Con) given a single comparative object . Text Label Value Text Label Value Pro Con very good 1.00 good 0.75 yes / pro yes / con 0 bad 0.25 yes / pro no +1 very bad 0.00 no yes / con -1 other 0.50 no no 0 other other 0 As a second method to obtain the argument quality we also use the T0++ model [10] and prompt it to generate a text to the following task: . How would you rate the readability and consistency in this sentence? very good, good, bad, very bad, where is one of the passage sentences. We then map the models textual outputs to numeric values using the mapping shown in Table 1a. Stance Detection. Stance detection for each sentence uses the same conceptual approaches but with different inputs and outputs. Since both the IBM Debater API [26] and T0++ [10] can predict only a single-target stance (i.e., for one of the two comparison objects), we combine the two single-target stance scores into a multi-target stance by taking the difference between the stance towards the first object and the stance towards the second object. We also experimented with different thresholds for the minimal difference between the single-target stances and found a threshold of 0.125 to work well by manually examining some classified examples. For scoring the single-target argument stance for a sentence with the IBM Debater API, we again query the API with a sentence (argument) and a topic created using one of the comparison objects The classifier [26] then computes an argument’s likelihood of being pro, con, or neutral with respect to the topic (i.e., the comparison object in our pipeline) by first classifying a sentiment and then detecting whether the topic’s and argument’s targets contradict each other. The API then returns a score from from -1 (against the comparison object) to +1 (in favor). By classifying different topics for each object (i.e., is good and is the best), we determine an averaged single-target stance for each comparison object. When using the T0++ for the stance detection, we first experiment with directly prompting the model to output ‘pro’, ‘con’, or ‘neutral’ labels for the comparison objects. We formulate the task as two simple questions passed to the model, one to determine whether the sentence has a positive stance towards the comparison object and one to determine whether it has a neg- ative stance: Is this sentence pro ? yes or no and Is this sentence against ? yes or no, where is one sentence of the passage and is one of the comparative objects. This results in two answers (yes Table 2 Axioms used in our retrieval pipeline. An asterisk (⋆ ) indicates newly proposed axioms. Name Description ArgUC [20] Prefer more argumentative units. QTArg [20] Prefer more query terms in argumentative units. QTPArg [20] Prefer earlier query terms in argumentative units. CompArg⋆ Prefer more comparative objects in argumentative units. CompPArg⋆ Prefer earlier comparative objects in argumentative units. aSLDoc [27] Prefer passages with 12–20 words per sentence. ArgQ⋆ Prefer higher argument quality. or no) for the positive and negative stance respectively. We combine the two textual answers using the mapping shown in Table 1b. 3.4. Axiomatic and Stance-based Re-rankers Since recall of our retrieval system is increased by expanding and reformulating queries (cf. Section 3.1), we seek to improve precision by re-ranking the top-10 passages from the first-stage retrieval (cf. Section 3.2) using two different strategies that should rank more argumentative and of higher quality passages also ensuring a balanced overview of the two comparison objects. (1) We re-rank based on argumentativeness axioms, and (2) we re-rank based on the passages’ stances towards the comparison objects. Argumentative Axiomatic Re-ranking. Ranking methods such as BM25 or query likeli- hood with Dirichlet smoothing do not capture the “argumentativeness” in text that is important for argument retrieval [5]. Some approaches for at the TREC Common Core and Decision tracks exploit task-specific, argumentativeness axioms to address the document argumenta- tiveness [20, 27]. Axioms are constraints that define pairwise ranking preferences between documents or passages. Because of the promising development in the field of axiomatic infor- mation retrieval [28], we re-rank the top-10 initially retrieved passages with the KwikSort algorithm [19]. For axiomatic re-ranking, we compute preferences for 7 argumentativeness axioms specified in Table 2. The axioms cover general argumentativeness (ArgUC), argumenta- tive relevance (QTArg, QTPArg), comparative relevance (CompArg, CompPArg), and rhetorical and argumentative quality (aSLDoc, ArgQ). We then combine the axioms in a majority vot- ing scheme, i.e., we only keep preferences where at least 50 % of the 7 axioms agree, and fall back to the original ranking order if less than 50 % of all axioms agree. Using the ir_axioms framework [28],6 we then re-rank with the combined axiom. Stance-based Re-ranking. We also implement a stance-based re-ranker to produce rankings where the two conflicting stances (pro first comparison object and pro second comparison object) are nearly equally present. For balancing the stances, we experiment with two different 6 https://github.com/webis-de/ir_axioms/ re-ranking strategies: (1) alternating stance and (2) balanced top-𝑘 stance. For the alternating stance strategy, we split the result set into three lists: (1) with arguments in favor of the first comparison object, (2) in favor of the second comparison object, and (3) neutral arguments or arguments with no stance. We then alternately select passages from the first two lists. If one or both lists are empty, we fall back to the neutral list. The balanced top-𝑘 stance strategy is based on the original ranking. Here we count the number of passages in favor of the first comparison object and the second comparison object in the top-𝑘 initially retrieved passages. If the difference of these two values is greater than 1, we move the last passage from the majority within the top-𝑘 ranking behind the first minority passage after the top-𝑘 ranking. This way, passages of the underrepresented stance advance the ranking until the ranking is balanced in the top-𝑘 positions. In initial experiments, however, we find the alternating stance strategy to be more promising, because the balanced top-𝑘 stance strategy often lead to rankings containing mostly neutral passages. 4. Submitted Runs We submit five runs that use different components and strategies of our pipeline (cf. Section 3) to the Touché second task. Instead of uploading generated run files, we deploy our retrieval system as a working software on the TIRA platform [11]. Query Likelihood Baseline (Run 1). For our first run, we simply retrieve top-100 passages ranked by query likelihood with Dirichlet smoothing [7] (𝜇 = 1000) for the original, unmodified queries (topic titles) and tag argument stance by comparing sentiments for each object using the IBM Debater API, treating a stance under a threshold of 0.125 as neutral. Argument Axioms (Run 2). To produce our second run, we re-rank the top-10 passages from the baseline result using KwikSort [28, 19] based on preferences from the argument axioms as described in Section 3.4. Stance-based Re-ranking with Argumentative Axioms (Run 3). Our third run also uses argument axiomatic re-ranking after the baseline retrieval. However to ensure that the stances towards both comparison objects are nearly equally represented in the result ranking, we apply stance-based re-ranking with the alternating stance strategy as described in Section 3.4. All You Need is T0 (Run 4). Large language models have recently found application in many NLP tasks, web search, or retrieval. The trend of using large language models for solving almost any task has also been criticized. For instance, Shah and Bender [3] highlight conceptual flaws that question if such an extreme usage of not fully understood models is desirable when implementing search for answers to real-life questions (e.g., in search engines). In our fourth submitted run, we want test a language model’s T0++ zero-shot classification abilities. First, we reformulate and generate and combine queries; final queries are an expansion of the topic titles (cf. Section 3.1). We then retrieve 100 documents using query likelihood, and use T0++ again to estimate argument quality and stance (cf. Section 3.3). Argumentative Stance-based Re-ranking with T0 (Run 5). In our last run, we combine most of the methods introduced in Section 3 to generate a ranking that is both as argumentative as possible and equally represents argument stances, but also uses T0++ for query reformulation and expansion. Here, we combine new queries generated by T0++ and reformulate queries by replacing synonyms returned by T0++. However, we also use synonyms from the fastText [21] embedding similarity method (cf. Section 3.1); final queries are an expansion of the topic titles. The top-10 results of the 100 passages retrieved using query likelihood for the expanded queries are then re-ranked based on the argumentativeness axioms and by alternating stance (cf. Section 3.4). 5. Results We evaluate our approach by effectiveness to retrieve relevant and high-quality passages and to predict the correct stance towards the comparison objects, using manual judgments provided by Touché. The task organizers asked human volunteers to label each document pooled from all submitted runs at depth 5 with respect to relevance (0: not relevant, 1: relevant, 2: highly relevant), rhetorical quality (0: low quality or not argumentative, 1: average quality, 2: high quality), and stance (pro first object, pro second object, neutral, no stance). The results for the relevance and quality effectiveness using nDCG@5 (Tables 3 and 4) show that our baseline Run 1 using query likelihood with Dirichlet smoothing performs worse than the BM25 baseline (Puss in Boots [5]). Since our other runs re-rank retrieved results from the initial ranking, we compare our individual re-ranking strategies. Nonetheless, we acknowledge that all of our submitted runs are outperformed by the BM25 baseline and other dense rankers’ results submitted to the shared task. The differences in nDCG@5 scores compared to our query likelihood baseline indicate that axiomatic re-ranking (Run 2) can increase consistency with argumentativeness axioms while retaining equal retrieval effectiveness. Unfortunately, query expansion with T0++ slightly decreases nDCG@5 on average by about 3 p.p. for relevance judgments and 2 p.p. for quality judgments. Stance-based re-ranking, however, can increase nDCG@5 by up to 5 p.p. for relevance judgments and by 4 p.p. for quality judgments. None of our re-ranking stages could sufficiently compensate for the worse retrieval performance of the initial query likelihood ranking. For stance detection, we compare the T0-based stance classification approach with the best competing team’s approach (Captain Levi, pre-trained RoBerta without fine-tuning) and the baseline (Puss in Boots) that predicts the majority class (‘no stance’). In Table 5, we report a macro-averaged F1 -score per run and per team as well as the number of documents 𝑁 for which the predicted stance has a ground-truth label as provided by the task organizers. We observe that since only the top-5 passages were pooled for manual judgments only a limited number of predicted stance labels (e.g. 1208 for Run 4) can be used for evaluation, even though we predicted the stance up to depth 100 (i.e. 5000 predicted stance labels per run). In this setting our Run 4 (i.e. stance prediction using T0++; cf. Section 3.3) has the highest macro-averaged F1 -score of all submitted runs to the task. However, due to the limited number of labels available for evaluation and because the number of available labels differs across teams and runs, we cannot directly compare different runs. For example, the 3 792 unjudged labels from Run 4 could be correctly Table 3 Relevance results of selected runs submitted to Task 2: Argument Retrieval for Comparative Questions. Reported are the mean nDCG@5 and the 95% confidence intervals for our runs, the best task’s run result (team Captain Levi), and the official task baseline (Puss in Boots, in italics). Team Run nDCG@5 Mean Low High Captain Levi [29] dense_initial_retr. 0.758 0.708 0.810 Puss in Boots [5] BM25-Baseline 0.469 0.403 0.535 Grimjack Run 3 0.422 0.349 0.500 Grimjack Run 2 0.376 0.299 0.455 Grimjack Run 1 0.376 0.301 0.459 Grimjack Run 5 0.349 0.270 0.425 Grimjack Run 4 0.345 0.273 0.425 Table 4 Quality results of selected runs submitted to Task 2: Argument Retrieval for Comparative Questions. Reported are the mean nDCG@5 and the 95% confidence intervals for our runs, the best task’s run result (team Aldo Nadi), and the official task baseline (Puss in Boots, in italics). Team Run nDCG@5 Mean Low High Aldo Nadi [30] RF_reranked 0.774 0.719 0.828 Puss in Boots [5] BM25-Baseline 0.476 0.400 0.553 Grimjack Run 3 0.403 0.331 0.478 Grimjack Run 5 0.365 0.290 0.445 Grimjack Run 2 0.363 0.289 0.442 Grimjack Run 1 0.363 0.287 0.443 Grimjack Run 4 0.344 0.266 0.428 predicted (i.e., increasing F1 ) or incorrectly predicted (i.e., decreasing F1 ). As an alternative, comparable measure, in the rightmost columns of Table 5, we report F1 -scores of predicted stances of only the top-5 passages of each run. All 250 stance labels from the top-5 results of each submitted run have corresponding ground-truth labels due to the organizers’ top-5 pooling for manual judgment. When considering only the top-5 passages, our stance classification approach using T0++ falls behind Team Captain Levi’s best performing approaches. However, 250 samples might also be an insufficient sample size to compare classifier performance. It is also unclear how examining only top results affects the evaluation of classification performance. 6. Conclusion In our approaches to retrieve relevant and high-quality argumentative passages that help answer comparative questions, we combine query reformulation and expansion techniques with axiomatic re-ranking exploiting argumentative structure and argument quality and stance. Table 5 Stance detection results of selected runs submitted to Task 2: Argument Retrieval for Comparative Questions. Reported are a macro-averaged F1 score and number of documents N where the predicted stance has a ground-truth label for our runs, the best task’s run result (team Captain Levi), and the official task baseline that always predicts ‘no stance’ (Puss in Boots, in italics). F1 score is computed for all predicted stance labels with corresponding ground-truth labels (All) or only for the top-5 passages per run (Top-5). Team Run All Top-5 F1 N F1 N Grimjack Run 4 0.313 1208 0.235 250 Captain Levi [29] dense_initial_retr. 0.301 1688 0.359 250 Grimjack Run 2 0.207 1282 0.180 250 Grimjack Run 1 0.207 1282 0.180 250 Grimjack Run 3 0.207 1282 0.175 250 Grimjack Run 5 0.199 1180 0.168 250 Puss In Boots [5] Always-NO-Baseline 0.158 1328 0.159 250 Using the IBM Debater API and the T0++ language model, we showcase two state-of-the-art approaches for argument quality estimation. We extend previous query expansion approaches used in the Touché shared tasks by incorporating the contextual information provided in topics’ descriptions and narratives. To attain nearly equal exposure across argument stances in the final ranking, we balance the pro and con arguments on top-10 ranks. While none of our runs outperform the BM25 baseline in terms of nDCG@5 on relevance and quality judgments, we find that axiomatic re-ranking and stance-based re-ranking can slightly increase the effectiveness of the first-stage query likelihood ranking. This poses an interesting direction for future work: applying our proposed re-ranking strategies to results of other retrieval models, e.g., BM25. Since our run featuring query expansion with generated texts by T0++ is the worst-performing in terms of relevance and rhetorical quality, we also question the usefulness of large language models in early retrieval stages. Our results represent additional motivation to investigate the effect of explainability on retrieval performance, as recently questioned in the community. Our approach to stance classification heuristically maps single-target stance classification results to multi-target, and we were not able to find a satisfactory strategy to distinguish neutral stance from passages without stance. Arguably, fine-tuning a multi-class neural classifier like Bert on the stance dataset provided by Touché could possibly improve classification perfor- mance by directly predicting the multi-target stance. Our evaluation of F1 stance prediction performance yields no clear winner as the participating teams predicted stance labels for dif- ferent, potentially biased sub-sets of the document collection resulting in different test set coverage. We encourage future work to reproduce and evaluate stance prediction approaches of all participating teams on an independent test dataset. Acknowledgments This work was partially supported by the Deutsche Forschungsgemeinschaft (DFG) through the project “ACQuA 2.0” (Answering Comparative Questions with Arguments; project number 376430233). References [1] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument Retrieval, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/paper_261.pdf. [2] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument Retrieval, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21st - to - 24th, 2021, volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 2258–2284. URL: http://ceur-ws.org/Vol-2936/paper-205.pdf. [3] C. Shah, E. M. Bender, Situating Search, in: D. Elsweiler (Ed.), CHIIR ’22: ACM SIGIR Conference on Human Information Interaction and Retrieval, Regensburg, Germany, March 14 - 18, 2022, ACM, 2022, pp. 221–232. URL: https://doi.org/10.1145/3498366.3505816. [4] S. P. Cherumanal, D. Spina, F. Scholer, W. B. Croft, Evaluating Fairness in Argument Retrieval, in: G. Demartini, G. Zuccon, J. S. Culpepper, Z. Huang, H. Tong (Eds.), CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, ACM, 2021, pp. 3363–3367. URL: https://doi.org/10.1145/3459637.3482099. [5] A. Bondarenko, M. Fröbe, J. Kiesel, S. Syed, T. Gurcke, M. Beloucif, A. Panchenko, C. Bie- mann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2022: Argu- ment Retrieval, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association (CLEF 2022), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2022. [6] J. Lin, X. Ma, S. Lin, J. Yang, R. Pradeep, R. Nogueira, Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations, CoRR abs/2102.10073 (2021). URL: https://arxiv.org/abs/2102.10073. arXiv:2102.10073. [7] C. Zhai, J. D. Lafferty, A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval, in: W. B. Croft, D. J. Harper, D. H. Kraft, J. Zobel (Eds.), SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, September 9-13, 2001, New Orleans, Louisiana, USA, ACM, 2001, pp. 334–342. URL: https://doi.org/10.1145/383952.384019. [8] A. N. Chernodub, O. Oliynyk, P. Heidenreich, A. Bondarenko, M. Hagen, C. Biemann, A. Panchenko, TARGER: Neural Argument Mining at Your Fingertips, in: M. R. Costa- jussà, E. Alfonseca (Eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 3: System Demonstrations, Association for Computational Linguistics, 2019, pp. 195–200. URL: https://doi.org/10.18653/v1/p19-3031. [9] A. Toledo, S. Gretz, E. Cohen-Karlik, R. Friedman, E. Venezian, D. Lahav, M. Jacovi, R. Aharonov, N. Slonim, Automatic Argument Quality Assessment - New Datasets and Methods, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Association for Computational Linguistics, 2019, pp. 5624–5634. URL: https://doi.org/10.18653/v1/D19-1564. [10] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V. Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Févry, J. A. Fries, R. Teehan, S. Biderman, L. Gao, T. Bers, T. Wolf, A. M. Rush, Multitask Prompted Training Enables Zero-Shot Task Generalization, CoRR abs/2110.08207 (2021). URL: https://arxiv.org/abs/2110.08207. arXiv:2110.08207. [11] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF, volume 41 of The Information Retrieval Series, Springer, 2019, pp. 123–160. URL: https://doi.org/10.1007/978-3-030-22948-1_5. [12] M. Potthast, M. Hagen, B. Stein, The Dilemma of the Direct Answer, SIGIR Forum 54 (2020) 14:1–14:12. URL: https://doi.org/10.1145/3451964.3451978. [13] A. Bondarenko, Y. Ajjour, V. Dittmar, N. Homann, P. Braslavski, M. Hagen, Towards Understanding and Answering Comparative Questions, in: K. S. Candan, H. Liu, L. Akoglu, X. L. Dong, J. Tang (Eds.), WSDM ’22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21 - 25, 2022, ACM, 2022, pp. 66–74. URL: https://doi.org/10.1145/3488560.3498534. [14] T. Abye, T. Sager, A. J. Triebel, An Open-Domain Web Search Engine for Answering Comparative Questions, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/paper_130.pdf. [15] J. Huck, Development of a Search Engine to Answer Comparative Queries, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/ paper_178.pdf. [16] E. Shirshakova, A. Wattar, Thor at Touché 2021: Argument Retrieval for Comparative Questions, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21st - to - 24th, 2021, volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 2455–2462. URL: http://ceur-ws.org/Vol-2936/paper-219.pdf. [17] V. Chekalina, A. Panchenko, Retrieving Comparative Arguments using Ensemble Methods and BERT, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21st - to - 24th, 2021, volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 2354–2365. URL: http://ceur-ws.org/Vol-2936/paper-211.pdf. [18] H. Fang, T. Tao, C. Zhai, A Formal Study of Information Retrieval Heuristics, in: M. Sanderson, K. Järvelin, J. Allan, P. Bruza (Eds.), SIGIR 2004: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, July 25-29, 2004, ACM, 2004, pp. 49–56. URL: https://doi.org/10.1145/1008992.1009004. [19] M. Hagen, M. Völske, S. Göring, B. Stein, Axiomatic Result Re-Ranking, in: S. Mukhopad- hyay, C. Zhai, E. Bertino, F. Crestani, J. Mostafa, J. Tang, L. Si, X. Zhou, Y. Chang, Y. Li, P. Sondhi (Eds.), Proceedings of the 25th ACM International Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24-28, 2016, ACM, 2016, pp. 721–730. URL: https://doi.org/10.1145/2983323.2983704. [20] A. Bondarenko, M. Hagen, M. Völske, B. Stein, A. Panchenko, C. Biemann, Webis at TREC 2018: Common core track, in: E. M. Voorhees, A. Ellis (Eds.), Proceedings of the Twenty- Seventh Text REtrieval Conference, TREC 2018, Gaithersburg, Maryland, USA, November 14-16, 2018, volume 500-331 of NIST Special Publication, National Institute of Standards and Technology (NIST), 2018. URL: https://trec.nist.gov/pubs/trec27/papers/Webis-CC.pdf. [21] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information, Trans. Assoc. Comput. Linguistics 5 (2017) 135–146. URL: https://transacl. org/ojs/index.php/tacl/article/view/999. [22] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-Art Natural Language Processing, in: Q. Liu, D. Schlangen (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, November 16-20, 2020, Association for Computational Linguistics, 2020, pp. 38–45. URL: https://doi.org/10.18653/v1/2020.emnlp-demos.6. [23] M. F. Porter, An Algorithm for Suffix Stripping, Program 14 (1980) 130–137. URL: https: //doi.org/10.1108/eb046814. [24] S. Bird, E. Klein, E. Loper, Natural Language Processing with Python, O’Reilly, 2009. URL: http://www.oreilly.de/catalog/9780596516499/index.html. [25] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapo- lis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computa- tional Linguistics, 2019, pp. 4171–4186. URL: https://doi.org/10.18653/v1/n19-1423. [26] R. Bar-Haim, I. Bhattacharya, F. Dinuzzo, A. Saha, N. Slonim, Stance Classification of Context-Dependent Claims, in: M. Lapata, P. Blunsom, A. Koller (Eds.), Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers, Association for Computational Linguistics, 2017, pp. 251–261. URL: https://doi.org/10.18653/v1/e17-1024. [27] A. Bondarenko, M. Fröbe, V. Kasturia, M. Hagen, M. Völske, B. Stein, Webis at TREC 2019: Decision Track, in: E. M. Voorhees, A. Ellis (Eds.), Proceedings of the Twenty-Eighth Text REtrieval Conference, TREC 2019, Gaithersburg, Maryland, USA, November 13-15, 2019, volume 1250 of NIST Special Publication, National Institute of Standards and Technology (NIST), 2019. URL: https://trec.nist.gov/pubs/trec28/papers/Webis.D.pdf. [28] A. Bondarenko, M. Fröbe, J. H. Reimer, B. Stein, M. Völske, M. Hagen, Axiomatic Retrieval Experimentation with ir_axioms, in: 45th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2022), ACM, 2022. doi:10.1145/3477495. 3531743. [29] A. Rana, P. Golchha, R. Juntunen, A. Coajă, A. Elzamarany, C.-C. Hung, S. P. Ponzetto, LeviRANK: Limited Query Expansion with Voting Integration for Document Retrieval and Ranking, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022. [30] M. Aba, M. Azra, M. Gallo, O. Mohammad, I. Piacere, G. Virginio, N. Ferro, SEUPD@CLEF: Team Kueri on Argument Retrieval for Comparative Questions, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings, 2022.