Overview of Touché 2022: Argument Retrieval

Overview of Touché 2022: Argument Retrieval AlexanderBondarenko Martin-Luther-Universität Halle-Wittenberg MaikFröbe Martin-Luther-Universität Halle-Wittenberg JohannesKiesel Bauhaus-Universität Weimar ShahbazSyed Leipzig University TimonGurcke University

Paderborn

MeriemBeloucif Uppsala University AlexanderPanchenko Skolkovo Institute of Science and Technology ChrisBiemann Universität Hamburg BennoStein Bauhaus-Universität Weimar HenningWachsmuth University

Paderborn

MartinPotthast Leipzig University MatthiasHagen Martin-Luther-Universität Halle-Wittenberg Overview of Touché 2022: Argument Retrieval 1613-0073 6263738CDF733D87F3F52FE41AFA597C GROBID - A machine learning software for extracting information from scholarly documents Argument retrieval Controversial questions Comparative questions Image retrieval Shared task

This paper is a report on the third year of the Touché lab on argument retrieval hosted at CLEF 2022. With the goal of supporting and promoting the research and development of new technologies for argument mining and argument analysis, we have organized three shared tasks: (a) argument retrieval for controversial topics, where the task is to find sentences that reflect the gist of arguments from online debates, (b) argument retrieval for comparative issues, where the task is to find argumentative passages from web documents that help in making a comparative decision, and (c) image retrieval for arguments, where the task is to find images that show support for or opposition to a particular stance.

Introduction

Decision-making and opinion-forming are everyday tasks, often involving weighing pro and con arguments for or against different options. Considering the many arguments on almost any topic on the web, in principle anyone can come to an informed decision or opinion with the help of a search engine. However, large parts of the easily accessible arguments on the web are of low quality. They may contain incoherent logic, fail to substantiate a claim, or use inappropriate language. These arguments should not appear at the top of search results-regardless of whether a query is about socially important issues or "only" personal choices. Challenges arising from this observation range from evaluating the relevance of an argument to a query and assessing how well an implied stance is justified, to identifying the gist of an argument, to finding images that illustrate a particular stance. Commercial web search engines do not sufficiently address these challenges-a gap we aim to fill with the Touché labs.

Following the two successful Touché labs on argumentation at CLEF 2020 and 2021 [2,3], our third lab edition again brought together researchers from the fields of information retrieval and natural language processing who study argumentation. At Touché 2022, we have organized the following three shared tasks, the last of which is a completely new addition: 1. Argumentative sentence retrieval from a focused collection (crawled from debate portals)

to support conversations about controversial topics. 2. Argument retrieval from a large collection of text passages to support answering comparative questions in personal decision making. 3. Argumentative image retrieval to support the illustration of arguments and getting an overview of the public opinion on controversial topics.

Touché follows the traditional TREC methodology: documents and topics are provided to participants, who then submit their results (up to five runs) for each topic to be assessed by human assessors. While the first two Touché editions focused on full argument and document retrieval, the third edition focused on more fine-grained retrieval units. The three shared tasks investigated whether argument retrieval can more directly support decision making and opinion formation by extraction of the gist of documents, classification of their stance on an issue as pro or con, and retrieval of images that support or oppose a particular stance.

The teams that participated in the third Touché lab were able to use the topics and assessments (relevance and quality of arguments) from the previous lab editions to train and optimize their approaches. In addition to traditional retrieval models such as BM25 [4], re-ranking approaches such as the recent transformer-based models T5 [5] and T0 [6] have been applied with the goal of combining topical relevance with "argumentativeness," argument quality, or stance. They are an essential part of the most effective approaches of all three Touché tasks, confirming the general trend in information retrieval and natural language processing that pre-trained transformers achieve good effectiveness [7] (cf. . The most effective approach submitted to Task 1 re-ranks the DirichletLM model's search results by first using a BERT-based classifier [8] to decide on the argumentativeness of retrieved sentence pairs (i.e., whether they are premises or assertions), then estimating their coherence using the cosine similarity of their BERT embeddings. For Task 2, in terms of relevance, a TCT-ColBERT ranker [9] and, in terms of quality, a combination of query-dependent BM25F scores [10] and predicted argument quality were most effective. The most effective approach for Task 3 (across topic relevance, argumentativeness, and stance relevance) used BERT instead of a stance detection model to detect the sentiment of texts from web pages and texts in images and indexed both with BM25F.

Altogether, the most effective argument retrieval approaches used various strategies for query reformulation and expansion, and for re-ranking based on estimates of argument quality or "argumentativeness". Sentiment or emotion recognition was particularly useful for the argumentative image retrieval task, as well as OCR to retrieve image text for analysis.

The corpora, topics, and judgments created at Touché are freely available to the research community and can be found on the lab's website. 1 Parts of the data are also already available via the BEIR [11] and ir_datasets [12] resources.

Related Work

Queries in argument retrieval often are phrases that describe a controversial topic, questions that ask to compare two options, or even complete claims or short arguments [13]. In the third edition of the Touché lab, we address the first two query types in three different shared tasks on argument retrieval in general, on comparative scenarios, and on image retrieval. Here, we briefly summarize the related work for all three tasks.

Argument Retrieval

The goal of argument retrieval is to find arguments that help when making a decision, when forming an opinion, or when trying to convince (or persuade) someone of a specific point of view. An argument is usually modeled as a conclusion with one or more supporting or attacking premises [14]. While a conclusion is a statement that can be accepted or rejected, a premise is a more grounded statement (e.g., statistical evidence or a referenced quote).

Adding argument retrieval components to a search engine poses challenges like identifying argumentative queries [15], mining arguments from documents, or assessing an argument's relevance and quality [14]. Different paradigms have been proposed for actual argument retrieval that perform argument mining and ranking in different order [16]. For instance, Wachsmuth et al. [14] use distant supervision and extract and index arguments from debate portals in a "pre-processing". Their argument search engine args.me2 uses BM25F [10] to then only rank the extracted arguments at query time, giving more weight to conclusions than premises. Also Levy et al. [17] use distant supervision to mine arguments from Wikipedia in an offline pre-processing before ranking. Following a different paradigm, Stab et al. [18] retrieve documents from the Common Crawl3 at query time (no prior offline argument mining) and use a topic-dependent neural network to then extract arguments from the retrieved documents. In our Touché tasks, we address both paradigms, the one of Wachsmuth et al. [14] in Task 1 (retrieval from a focused collection of pre-processed arguments) and the one of Stab et al. [18] in Task 2 (retrieval from some general collection with online argument mining).

Argument retrieval should take topical relevance into account but also argument quality. What makes a good argument has been studied since the time of Aristotle [19]. Wachsmuth et al. [20] categorize the different aspects of argument quality into a taxonomy that covers three dimensions: logic, rhetoric, and dialectic. Logic concerns the strength of the internal structure of an argument (i.e., the conclusion and the premises along with their relations) while rhetoric covers the effectiveness of an argument in persuading an audience with its conclusion. Lastly, dialectic addresses the relations of an argument to other arguments on the topic. For example, an argument attacked by many others may be rather vulnerable in a debate. Note that an argument's relevance to a query is also categorized under dialectical quality [20].

Argument relevance has been typically assessed by an argument's similarity to a given topic and by incorporating the support and attack relations to other arguments. Potthast et al. [21] evaluate four standard retrieval models for ranking arguments with regard to topical relevance, logic, rhetoric, and dialectic. One of the main findings is that DirichletLM is better at ranking arguments than BM25, DPH, and TF-IDF. Gienapp et al. [22] later proposed a pairwise annotation strategy that reduces the costs of crowdsourcing argument retrieval annotations by 93% (i.e., requiring the annotation of only a rather small subset of argument pairs).

As for argument ranking, several approaches exploit argument relations. For instance, Wachsmuth et al. [23] connect two arguments in a graph when one uses the other's conclusion as a premise and then compute an argument's PageRank [24] in this graph. In their study, taking PageRank into account improves upon baselines that only use an argument's content and internal structure (conclusion and premises) [23]. Later, Dumani et al. [25] used support and attack relations between clusters of premises and claims as well as between clusters of claims and a query. In an extended version, Dumani and Schenkel [26] also include the quality of a premise as a probability (fraction of premises that are worse with regard to cogency, reasonableness, and effectiveness). Using a pairwise quality estimator trained on the Dagstuhl-15512 ArgQuality Corpus [27], the approach with the argument quality component was more effective on the 50 topics of Task 1 from Touché 2020 than the one without taking argument quality into account.

Retrieval for Comparisons

Comparative information needs in web search have first been addressed with basic interfaces for comparing two products entered separately in two search boxes [28,29]. Using opinion mining approaches, comparative sentences can then be identified from product reviews in favor of or against one or the other product [30,31,32]. Recently, identifying a comparison preference in a sentence (i.e., the "winning" option) has also been tackled more broadly (not just for product reviews) [33,34] and forms the basis of the comparative argumentation machine CAM [35]. Similar to the early comparison interfaces, CAM takes two objects and some comparison aspect(s) as input, retrieves comparative sentences in favor of one or the other option using BM25, and then classifies the sentences' preferences for a final merged table-like result presentation. A proper argument ranking, however, was not included in CAM. Chekalina et al. [36] later extended the system to accept complete comparative questions as input and to return a natural language answer. From a comparative question, the comparison objects, aspect(s), and predicates are extracted and the system's answer is either generated directly based on transformers [8] or by retrieval from an index of comparative sentences. To identify comparative questions and information needs, Bondarenko et al. [37,38] propose a cascading ensemble of classifiers (rule-based, feature-based, and neural models). They also propose improved approaches to extract the comparison objects, aspects, and predicates from comparative questions and to detect the stance of potential answers towards the comparison objects. The respective stance dataset could also be used by the participants of our Task 2.

Image Retrieval

Images can provide contextual information and express, underline, or popularize an opinion [39], thereby taking the form of subjective statements [40]. While some images can be complete arguments (i.e., expressing both, a premise and a conclusion) [41,42] others provide contextual information only and have to be combined with a textual conclusion to form an argument. A recent SemEval task distinguished a total of 22 persuasion techniques in memes alone [43].

Moreover, argument quality dimensions like acceptability, credibility, emotional appeal, and sufficiency [27] all also apply to arguments that include images.

Pre-dated only by approaches relying on metadata and similarity measures [44], the actual content of images or videos has been analyzed and used for keyword-based image search for decades [45]. In a recent survey, Latif et al. [46] categorize image features into color, texture, shape, and spatial features but commercial search engines also index text found in images, surrounding text, alternative texts displayed when an image is unavailable, and the image URLs [47,48]. As for the retrieval of argumentative images, a closely related concept is "emotional images", which is based on image features like color and composition [49,50]. Since argumentation often goes hand in hand with emotions, emotional features may also be promising for retrieving images for arguments, a relatively new task recently proposed by Kiesel et al. [51] and now forming Task 3 of the Touché 2022 lab.

Lab Overview and Statistics

For the third edition of the Touché lab, we received 58 registrations, doubling the number from the previous year (29 registrations in 2021). Among the teams, 27 registered for more than one task, 17 registered particularly for Task 1, 10 for Task 2, and 4 for Task 3 (the new task this year). The majority of registrations came from Germany and Italy (13 each), followed by India (12), the United States (3), the Netherlands, France, Switzerland, Bangladesh (2 each), Pakistan, Portugal, United Kingdom, Indonesia, China, Russian Federation, Bulgaria, Nigeria, and Lebanon (1 each). Aligned with the lab's fencing-related title, the registered teams selected a real or fictional fencer or swordsman character (e.g., D'Artagnan) as their team name.

From the 58 registered teams, 23 actively participated in the tasks and submitted results4 (27 teams submitted in 2021 and 17 teams in 2020). Using the setup of the previous Touché editions, we encouraged the teams to deploy their software in TIRA [52] for a better reproducibility of the developed approaches. The TIRA integrated research architecture is cloud-based evaluation-as-a-service platform where shared task participants can deploy their software in a dedicated virtual machine to which they have full administrative access. By default, the virtual machines run the server version of Ubuntu 20.04 with one Intel Xeon E5-2620 CPU, 4 GB RAM, 16 GB HDD, and the latest versions of often-used software packages pre-installed (e.g., Docker and Python). If needed, we tried to customize the resources as per a team's requirements. Providing GPUs was not possible, though.

For teams that did not deploy their software in TIRA, we allowed run submissions similar to many TREC tracks. In case they preferred software submissions, the teams created their run using via web UI of TIRA by remote-executing their software inside their virtual machine. The software is fully installed in the virtual machine, and at execution time the virtual machine is shut down, disconnected from the internet, powered on again in a sandbox mode, and the test datasets for the respective tasks are mounted. Interrupting the internet connection ensures that the participants' software works without external web services that may disappear or become incompatible, which could reduce reproducibility (i.e., downloading additional external code or models during the execution is not possible). We offered support in case of problems during deployment and then archived the virtual machines that the participants used for their submissions. The respective systems can thus be re-evaluated or also applied to new datasets with the same input format.

Overall, 9 of the 23 teams submitted traditional run instead of deploying their software in TIRA. Per team, we allowed 5 runs and the run needed to follow the standard TREC format. 5We checked the validity of each submitted run and asked participants to rerun their software or resubmit their files in case of problems while also offering support in such cases. In total, 84 runs were submitted-at least one from each team.

Task 1: Argument Retrieval for Controversial Questions

The goal of the Touché 2022 lab's first task was to support individuals who search for opinions and arguments on socially important controversial topics like "Are social networking sites good for our society?". Such scenarios benefit from obtaining the gists of various web resources that briefly summarize different stances (pro or con) on controversial topics. The task we considered in this regard followed the idea of extractive argument summarization [53].

Task Definition and Data

Task. Given a controversial topic and a collection of arguments, the task was to retrieve sentence pairs that represent the gist of their corresponding arguments (e.g., the main claim and a supporting premise). Sentences in such a pair may not contradict each other and ideally build upon each other in a logical manner comprising a coherent text.

Topics. We used 50 controversial topics from the previous iterations of Touché. Each topic is formulated as a question that the user might pose as a query to the search engine, accompanied by a description summarizing the information need and the search scenario, along with a narrative to guide assessors in recognizing relevant results (see Table 1).

Document collection.

The document collection for Task 1 was based on the args.me corpus [16] which contains about 400,000 structured arguments (crawled from the online debate portals debatewise.org, idebate.org, debatepedia.org, and debate.org). It is freely available for download 6 and can also be accessed through the args.me API. 7 To account for this year's changes in the task definition (the focus on gists), we prepared a pre-processed version of the corpus. Preprocessing steps included sentence splitting and removing premises and conclusions shorter than two words, resulting in 5,690,642 unique sentences with 64,633 claims and 5,626,509 premises. Description Democracy may be in the process of being disrupted by social media, with the potential creation of individual filter bubbles. So a user wonders if social networking sites should be allowed, regulated, or even banned.

Narrative

Highly relevant arguments discuss social networking in general or particular networking sites, and its/their positive or negative effects on society. Relevant arguments discuss how social networking affects people, without explicit reference to society.

Evaluation Setup

Participants submitted their rankings as traditional TREC-style runs where document IDs are sorted by descending relevance score for each search topic (i.e., the most relevant argument occurs at Rank 1). Given the large number of runs and the possibility of retrieving up to 1000 documents (in our case, these are sentence pairs) per topic in a run, using TrecTools [54], we created the pools using a top-5 pooling strategy, resulting in 6,930 unique sentence pairs for manual assessment of relevance, quality (argumentativeness), and textual coherence. Relevance was judged by our volunteer assessors on a three-point scale: 0 (not relevant), 1 (relevant), and 2 (highly relevant). For quality, annotators assessed whether a retrieved pair of sentences are rhetorically well-written on a three-point scale: 0 (low quality/non-argumentative), 1 (average quality), and 2 (high quality). Textual coherence (if the two sentences in a pair logically build upon each other) was also judged on a three-point scale: 0 (unrelated/contradicting), 1 (average coherence), and 2 (high coherence).

Submitted Approaches and Evaluation Results

This year's approaches included standard retrieval models such as TF-IDF, BM25, DirichletLM, and DPH. Participants also used third-party toolkits, such as the Project Debater API [55] (for stance and evidence detection in arguments), Apache OpenNLP8 (for language detection), and BERT-based classifiers proposed by Reimers et al. [56] trained on the Webis Argument Quality Corpus [22] and the IBM Rank 30K dataset [57] for argument quality detection. Additionally, semantic similarity of word and sentence embeddings based on doc2vec [58], Spacy embeddings [59], and SBERT [60] have been employed for retrieving coherent sentence pairs as required by the task definition. One team leveraged the text generation capabilities of GPT-2 [61] to find subsequent sentences while another team similarly used the next sentence prediction (NSP) of BERT [8] for this. These toolkits augmented the document preprocessing and re-ranking of the retrieved results.

Table 2

Results of Task 1 (Argument Retrieval for Controversial Questions). Shown are the scores of a teams' best run for the three dimensions relevance, quality, and coherence of the retrieved sentence pairs with along a run's rank (results of all submitted runs in Tables 6-8). The teams are ordered alphabetically; baseline Swordsman emphasized. A † indicates statistically significant differences to the baseline (paired Student's 𝑡-test, 𝑝 = 0.05, Bonferroni-correction). We used nDCG@5 to evaluate of relevance, quality, and coherence. Table 2 shows the results of the best run per team. On all the evaluated dimensions at least eight out of ten teams managed to beat the provided baseline. Similar to previous years' results, quality is best covered by the approaches followed by relevance and the newly added coherence dimension.

Team

Summarizing the results, for relevance, Team Porthos [62] achieved the highest rank followed by Daario Naharis [63] with nDCG@5 scores of 0.742 and 0.683, respectively. For the quality and coherence dimensions Daario Naharis obtained the highest scores (0.913 and 0.458) followed by Porthos (0.873 and+0.429). We believe that the two-stage re-ranking employed by Daario Naharis improved coherence and quality in comparison to the other approaches. They first ensured that retrieved pairs were relevant to their context in the argument alongside the topic which preserved high-quality arguments. Then, a second re-ranking based on stance to determine the final pairing of the retrieved sentences boosted coherence. Below, we briefly describe our baseline and summarize the submitted approaches.

Our baseline Swordsman employed a graph-based approach that ranks arguments' sentences by their centrality in the corresponding argument graph as proposed by Alshomary et al. [53]. The top two sentences per argument are used as the their gist. We retrieved 1000 pairs per topic.

Bruce Banner [64] employed the BM25 retrieval model implemented in the Pyserini toolkit [65] with its default parameters (𝑘 1 = 1.2 and 𝑏 = 0.68). For each argument, they indexed all possible sentence pairs. To speed up computation on such a large collection of sentence pairs, they specifically opted for the sparse representations in Pyserini that produce smaller indexes compared to the dense retrieval variants. Two query variants were used: original query (topic title) and an expanded query (narrative and description appended). Likewise, two variants of the sentence pairs were indexed: original pair and pair with the topic of a debate appended. They retrieved 1000 documents per query and did not apply any re-ranking. [66] also employed sparse retrieval together with text preprocessing and query expansion. For retrieval, they used two retrieval models from Lucene: BM25 [10] (𝑘 1 = 1.2 and 𝑏 = 0.75) and DirichletLM (𝜇 = 2000). For preprocessing, they experimented with both Porter [67] and Krovetz [68] stemmers. Additionally, they filtered both character and word ngrams (referred to as shingles) and used two stop word lists (SMART System [69], Glasgow IR. 9 ) Query expansion was done using synonyms from WordNet [70] and word2vec [71]. Evaluation on the previous year's relevance judgments showed that a combination of the DirichletLM retrieval model, the Krovitz stemmer, and the Glasgow IR stop word list improved performance compared to their respective counterparts.

D'Artagnan

Daario Naharis [63] developed a standard Lucene-based document retrieval system using the TF-IDF model. Additionally, they introduced a new measure called ICoefficient for scoring the discriminant power of a term. This complements the standard TF-IDF weighting by additionally considering the number of documents that contain at least one occurrence of a given term. We refer readers to Bahrami et al. [63] for the mathematical formulation of the ICoefficient. For preprocessing, they created two custom stop lists, each composed of the 100 most frequent terms in the indexed collections of the argument contexts and individual arguments from the provided corpus. Document re-ranking was performed based on stance and evidence detection using the Project Debater API [55].

Gamora [72] developed Lucene-based approaches using deduplication and contextual featureenriched indexing, adding the topic of a debate and the stance on the topic, to obtain documentlevel relevance and quality scores, following the approaches used in previous Touché editions [3]. To find relevant sentence pairs rather than relevant documents, these results were used to limit the number of documents by creating a new index for only the sentences of relevant documents (double indexing) or creating all possible sentence combinations and ranking them based on a weighted average of the argument quality (estimated using an SVM classifier) of the pair and its source document. BM25 [65] (𝑘 1 = 1.2 and 𝑏 = 0.75) and DirichletLM (𝜇 = 2000) were used for document similarity and SBERT [60] and TF-IDF for sentence similarity. The best approach is based on double indexing and a combination of a manual query reduction in which only the 2-6 main words of the query were kept, query boosting, query decorators, query expansion with respect to important keywords (GloVE [73]) and synonyms (WordNet [70]), and possessive removal, stemming (Krovetz stemmer [68]) and length filtering of the sentences.

General Grevious [74] used a conventional IR pipeline based on Lucene. First, documents were lowercased, tokenized and possessive words (with trailing ''s') were removed, keeping only tokens with a length between 3 and 20 characters. In addition, the team experimented with a variety of stemming approaches (S-stemmer [75], Krovetz stemmer [68], Porter stemmer [67], no stemming) and stop word lists (Core NLP [76], CountWordsFree [65], EBSCO, 10 GoogleStop, 11and Ranks. 12 ) To retrieve documents, BM25 [65] (𝑘 1 = 1.2 and 𝑏 = 0.75) and DirichletLM (𝜇 ∈ {1700, 1800}) were used together with query boosting, by assigning weights to the used inputs (argument, conclusion, debate title, and argument title), and query expansion, by finding keywords (Rapid Automatic Keyword Extraction (RAKE) [77]) and synonyms (Datamuse 13 ). This retrieval step was done once for the documents and once for all the potential sentence pairs within these retrieved documents to obtain a ranking of sentence pairs. Finally, sentiment analysis (Vader [78]) was used to boost documents that have a similar sentiment as the query, and readability analysis (Flesch-Kincaid [79]) was used for re-ranking. Their best model does not include re-ranking, stemming, and stop word removal but relies solely on the combination of query expansion and the BM25 retrieval model.

Gorgon [80] also used a Lucene-based IR pipeline and compared BM25 [65] (𝑘 1 = 1.2 and 𝑏 = 0.75) and DirichletLM (𝜇 = 2000) similarity measures, developing four different analyzers with different preprocessing steps including lowercasing, stemming (Krovetz stemmer [68]), removing possessive words (with trailing ''s') and filtering stop words (99webtools, 14 EBSCO). Sentence pairs were created from all combinations within a single document before indexing. The best approach is a combination of lowercasing, removing possessive words, and BM25.

Hit Girl [81] proposed a two-stage retrieval pipeline that combines semantic search and re-ranking via argument quality agnostic models. Documents were embedded to vectors using Spacy [59]. These were then indexed via Elasticsearch and its text similarity function used for semantic search. They experimented with three approaches for re-ranking: maximal marginal relevance [82], word mover's distance [83], and a novel method called structural distance which employs fuzzy matching between query and sentences based on POS tags. Preliminary evaluations showed that, while re-ranking improved the argument quality to varying degrees, it also affected relevance. Also, structural distance performed best for re-ranking.

Korg's [84] approaches are based on the Elasticsearch implementation of DirichletLM (𝜇 = 2000) to find the best matching argumentative sentences for a query after employing lowercasing, ASCII folding, stop word filtering (manually created stop word list) and stemming (Krovetz stemmer [68]). Then, either doc2vec [58] or SBERT [60] is trained on all sentences in the args.me corpus, which was used to find the most similar sentence pair within a document by direct comparison of the doc2vec embeddings. Alternatively, instead of directly comparing sentences, GPT-2 [61] was used to generate the next sentence for a given sentence to then find the most similar sentence to the generated sentence. The best approach is based on lowercasing, ASCII folding, stop word filtering, stemming, and doc2vec's similarity calculation without GPT-2.

Pearl [85] also proposed a two-stage retrieval pipeline using DirichletLM [86] and DPH [87] models to retrieve argumentative sentences. For both stages, they used the PyTerrier toolkit [88]. After retrieving the documents, two BERT-based argument quality models fine-tuned on the Webis Argument Quality Corpus [89], and the IBM-Rank-30k dataset [57] were employed to filter non-argumentative results. The resulting prototype from the first stage was considered the baseline model. On evaluating this on a set of 35 queries taken from the provided topics, they found that the DPH model assigned high relevance to sentences even if their terms are part of a URL, or other meta data in the corpus. Moreover, it was also susceptible to homonyms and thus negatively affecting the retrieval performance. To account for this, a refined prototype was developed that combined argument quality prediction with query expansion. For query expansion, they applied the Bo1 query expansion algorithm provided by PyTerrier which weighs the terms based on divergence from randomness build on Bose-Einstein statistics [90]. Specifically, the Bo1 model extracts terms from the top-ranked documents retrieved for the original query, weighs them based on their informativeness, and appends the highest-weighted terms to the original query to expand it. Finally, a custom block list consisting of commonly repeated phrases such as "my opponent claims... ", "PRO claims... ", "I accept this debate" filtered further noisy sentences, leading to improved nDCG scores.

Porthos [62] used the Elasticsearch implementation of DirichletLM (with 𝜇 = 116 being the average length of sentences in the corpus) and BM25 [65] (default Elasticsearch implementation with 𝑘 1 = 1.2 and 𝑏 = 0.75) or retrieval after removing sentence duplicates and filtering non-relevant sentences by removing ones with low-quality language to retain only the ones that contain at least one verb. Another filtering step is based on the argumentativeness of sentences using the support vector machine (SVM) of [22] and the BERT approach of [56]. In addition, sentences were stemmed, lowercased and stop words were removed. The approaches are based on a search term as a composition of single terms and Boolean queries together with Reimers et al. [56] to reorder the retrieved sentences according to their argumentative quality. The sentences are paired with SBERT [60] and BERT [8] trained on Next Sentence Prediction (NSP). The best approach is based on DirichletLM, NSP, using the sentence classifier in preprocessing, Boolean queries with Noun Chunking for retrieval, and the BERT approach of [56] for re-ranking.

Task 2: Argument Retrieval for Comparative Questions

The goal of the Touché 2022 lab's second task was to support informed decisions in "everyday" or personal comparison situations-for instance for a question like "Should I major in philosophy or psychology?". Decision making in such situations benefits from finding balanced reasons for choosing one option over the other, usually in form of opinions or arguments.

Task Definition and Data

Task. Given a collection of text passages and a comparative topic with two comparison objects, the task was to retrieve relevant argumentative passages for or against one or both objects, and to detect the passages' stances with respect to the objects.

Topics. We provided 50 topics that describe scenarios of personal decision making. Each topic has a title formulated as a comparative question, a pair of comparison objects from the title that could be used for the stance detection of the retrieved passages, a description with some background on the particular search scenario, and a narrative that served as a guideline for our assessors (cf. Table 3 for an example). Document collection. The retrieval collection for Task 2 was a corpus of 868,655 passages extracted from ClueWeb12. 15 We constructed this passage corpus using all 37,248 documents from the top-100 pool of all runs submitted to Task 2 in the previous Touché editions. Using the Description A soon-to-be high-school graduate finds themself at a crossroad in their life. Based on their interests, majoring in philosophy or in psychology are the potential options and the graduate is searching for information about the differences and similarities, as well as advantages and disadvantages of majoring in either of them (e.g., with respect to career opportunities or gained skills).

Narrative

Relevant documents will overview one of the two majors in terms of career prospects or developed new skills, or they will provide a list of reasons to major in one or the other. Highly relevant documents will compare the two majors side-by-side and help to decide which should be preferred in what context. Not relevant are study program and university advertisements or general descriptions of the disciplines that do not mention benefits, advantages, or pros/cons.

TREC CAsT tools, 16 we split the documents at sentence boundaries into fixed-length passages of approximately 250 terms, since ranking fixed-length passages was shown to be more effective than that of variable-length passages [91]. From the initial 1,286,977 passages, we removed near-duplicates with CopyCat [92] to mitigate unwanted side-effects of near-duplicates on retrieval effectiveness [93,94], resulting in the final collection of 868,655 passages. We also provided a second version of the corpus, in which the passages were expanded with queries generated by the docT5query model [95].

To lower the bar to entry of this task, we also provided the participants with a number of previously compiled resources. These included the document-level relevance and argument quality judgments from the previous Touché editions as well as the passage-level relevance judgments from a subset of MS MARCO [96] with about 40,000 comparative questions identified by an ALBERT-based [97] classifier [38]. Each question in MS MARCO is associated with 10 text passages (one is labeled as most relevant). To train stance detectors, an annotated dataset of 950 comparative questions and answers, extracted from Stack Exchange, was also provided [38]. For the identification of claims and premises, the participants could use any own or existing argument tagging tool, such as the API 17 of TARGER [98] hosted on our own servers.

Evaluation Setup

Similar to Task 1, we pooled the top-5 passages from the runs, resulting in 2,107 unique passages that were manually judged. Our volunteer human assessors labeled the passages' relevance with three labels: 0 (not relevant), 1 (relevant), and 2 (highly relevant). They also assessed whether arguments are present in a passage and whether they are rhetorically well-written [27] with three labels: 0 (low quality, or no arguments in a passage), 1 (average quality), and 2 (high quality). Finally, we asked the assessors to label passages with respect to a topic's comparison objects as (a) pro first object, (b) pro second object, (c) neutral (both comparison objects are equally good or bad), and (d) no stance (no stance given). In Task 2, we used nDCG@5 for the relevance and argument quality dimensions and macro-averaged F 1 for the stance detection.

Submitted Approaches and Evaluation Results

Seven teams submitted their results to Task 2 (25 valid runs). Interestingly, only two teams used relevance judgments from the previous Touché editions to fine-tune their models or to optimize parameters. The others either manually labeled a sample of retrieved documents themselves or relied on zero-shot approaches like the transformer-based model T0++ [6]. Most teams used the standard passage collection, but two teams also used the docT5query-expanded [95] collection provided by us. Overall, the main trend of this year was the usage of transformer-based models for ranking and re-ranking (e.g., ColBERT [99] or monoT5 and duoT5 [100]) while our baseline approach was BM25, as in the previous years.

For the optional subtask of stance detection, five of the seven teams submitted results. They either trained their own classifiers on the provided stance dataset, fine-tuned pre-trained language models, or directly used pre-trained models as zero-shot classifiers. Our baseline stance detector was a simple always-'no stance' predictor (majority class).

Table 4 shows the results of each team's most effective runs with respect to relevance and argument quality (more detailed results for each submitted run can be found in Appendix A). For stance detection, for each team, we evaluated all passages that were part of the manual judgment pool and for which the team had predicted a stance (i.e., the stance of a passage returned at Rank 3 by some Team X (and thus part of the judgment pool) was also used in the stance evaluation of Team Y, even when the document was only on Rank 6 or lower (and thus not actually part of the pool for that run). Note that this potentially yields different numbers of passages used for the stance evaluation per team. Below, we briefly describe the teams' submitted approaches and their results (teams ordered by their relevance-wise best approach).

Captian Levi [101] submitted the relevance-wise most effective run. They first retrieved 2,000 documents using Pyserini's BM25 [65] (𝑘 1 = 1.2 and 𝑏 = 0.68) by combining top-1000 results for the original query (topic title) with the results for modified queries, where they used alternative strategies: (1) only removing stop words (using the NLTK [102] stop word list), (2) replacing comparative adjectives with synonyms and antonyms found in WordNet [70], (3) adding extra terms using pseudo-relevance feedback, (4) using queries generated with the docT5query model [95] provided by the Touché organizers. Queries and corpus were also processed by using stop words and punctuation removal and lemmatization (WordNet lemmatizer). The initially retrieved results were re-ranked using monoT5 and duoT5 [100]. Additionally, TCT-ColBERT [9] (a variant of ColBERT [99] with knowledge distillation) was also used for initial ranking for unmodified queries (topic titles). Captain Levi submitted in total five runs that differ in the aforementioned strategies of modifying queries, initial ranking models, and final re-ranking models. Their most effective run in terms of relevance and quality was initial ranking by TCT-ColBERT. Finally, stance was detected using a RoBERTa-Large-MNLI

Table 4

Results of Task 2 (Argument Retrieval for Comparative Questions). (a) Evaluation results of a team's best run according to the results' relevance. (b) Best runs according to the results' quality. (c) Stance detection results (the teams' ordering is the same as in (b)). An asterisk ( ⋆ ) indicates that the runs with the best relevance and the best quality differ for a team. The baseline BM25 ranking is shown in bold; the baseline stance detector always predicts 'no stance'. A † indicates statistically significant differences to the baseline (paired Student's 𝑡-test, 𝑝 = 0.05, Bonferroni-correction). Since stance detection results were calculated for different numbers of predictions for each team, we do not test statistical differences. Tables 9-11 show the results for all submitted runs.

(a) Best relevance score per team , pre-trained on the Multi-Genre Natural Language Inference corpus [104] without further fine-tuning in two steps: (1) detecting if the document has a stance, and then (2) for documents that were not classified as 'neutral' or 'no stance', detecting which comparison object the document favors. This stance detector achieved the highest macro-averaged F 1 score.

Aldo Nadi [105] submitted the quality-wise most effective run. They re-ranked passages that were initially retrieved with BM25F [10] (default Lucene implementation with 𝑘 1 = 1.2 and 𝑏 = 0.75) on two fields: text of the original passages, and passages expanded with docT5query. All texts were processed with the Porter stemmer [67], removing stop words using different lists: (a) Snowball [106], (b) a default Lucene stop word list, (c) a custom list containing the 400 most frequent terms in the retrieval collection, excluding the comparison objects. Queries (topic titles) were expanded using a relevance feedback method based on the Rocchio Algorithm [107]. For the final ranking, the team experimented with two re-ranking techniques (involving up to the top-1000 documents from the initial results): (1) exploiting the argument quality estimation, i.e., they multiplied the document relevance and the quality scores, and (2) Reciprocal Ranking Fusion [108]. The quality scores were predicted using the IBM Project Debater API [55]. Aldo Nadi submitted five runs, which vary by different combinations of the proposed methods, e.g., using different stop word lists for pre-processing, using relevance feedback or not, using the quality-based re-ranking or fusion. The team's most effective run in terms of relevance used relevance feedback, and the most effective run in terms of quality was based on Reciprocal Ranking Fusion. The did not detect the stance.

Katana [109] submitted three runs that all used different variants of ColBERT [99]: (1) pretrained on MS MARCO [96] by the University of Glasgow, 18 (2) pre-trained by Katana from scratch on MS MARCO, replacing a cosine similarity between a query and a document representation with L2 distance, and (3) the latter model fine-tuned on the relevance and quality judgments from the previous Touché editions. As queries the team used topic titles without additional processing. The team's most effective run in terms of relevance used ranking by pre-trained ColBERT, and the most effective run in terms of quality used ranking by training Col-BERT from scratch (without further fine-tuning). For stance detection, Katana used a pre-trained XGBoost-based classifier that is part of Comparative Argumentation Machine [35,33].

Captain Tempesta [110] exploited linguistic properties of text such as a non-informative symbol frequency (hashtags, emojis, etc.), a difference between a short words' (less or equal than 4 characters) frequency and a long words' (more than 4 characters) frequency, and adjective as well as comparative adjective frequencies. Based on these properties for each document in the retrieval corpus, a quality score was computed as a weighted sum (weights were assigned manually). At query time, the relevance score of BM25 (Lucene; default: 𝑘 1 = 1.2 and 𝑏 = 0.75) was multiplied with the quality score, used as ranking criterion. Queries (topic titles) were processed by removing stop words (Lucene default list) and lowercasing query terms except for brand names, 19 stemming them using Lovins stemmer [111]. The team's five submitted runs differ in the weights manually assigned for the different quality properties. The team's most effective run in terms of relevance used document quality estimation with linguistic properties, and the most effective run in terms of quality did not. The team did not detect stance.

Olivier Armstrong [112] submitted one run. They first identified the comparison objects, aspects, and predicates in queries (topic titles) using a RoBERTa-based classifier proposed by Bondarenko et al. [38]. After removing stop words, queries were expanded with synonyms of the objects, aspects, and predicates found using WordNet. Then 100 documents were retrieved using Elasticsearch's BM25 (𝑘 1 = 1.2 and 𝑏 = 0.75) as initial ranking. Using a DistilBERT-based classifier [113], fine-tuned by Alhamzeh et al. [114] (a Touché 2021 participant), Olivier Armstrong identified premises and claims in the retrieved documents. For ranking, the following scores were calculated for each candidate document: (1) the arg-BM25 score returned by querying the new re-indexed corpus (only premises and claims are kept) using the unmodified queries (topic titles), (2) the argument support score, i.e., the ratio of premises and claims in the document, (3) the similarity score, i.e., the averaged cosine similarity between the original query and every premise and claim in the document represented using the SBERT embeddings [60]. The final score for each candidate document was calculated as sum of the normalized individual scores. Their final ranking included 25 documents. For stance detection, the team used an LSTM-based neural network with one hidden layer that was pre-trained on the provided stance dataset.

Puss in Boots was our baseline retrieval model that used the BM25 implementation in Pyserini [65] with default parameters (𝑘 1 = 0.9 and 𝑏 = 0.4) and original topic titles as queries. The baseline stance detector simply assigned 'no stance' to all documents in the ranked list.

Grimjack [115] submitted five runs using query expansion and query reformulation, argument quality estimation, stance detection, and axiomatic re-ranking. For the first ranking result, the team simply retrieved 100 passages ranked with a Pyserini implementation of DirichletLM (default 𝜇 = 1000), using original, unmodified queries (topic titles). Another approach re-ranked the top-10 of the initially retrieved passages using (1) argument axioms that "prefer" documents with more premises and claims (identified with TARGER [98]) or earlier occurrence of query terms in premises and claims [116,117], (2) newly proposed comparative axioms that "prefer" documents with more comparison objects or their earlier occurrence in premises and claims, and (3) an argument quality axiom that ranks higher documents with higher argument quality scores calculated using the IBM Project Debater API [55]. For another result ranking, document positions (from the previous run) were changed based on the predicted stance, such as the 'pro first object' document was followed by the 'pro second object' followed by 'neutral' stance. The document stance was predicted using the IBM Project Debater API [55]. The last two runs used T0++ [6] (1) to expand queries, e.g., by combining topic titles with newly generated queries, where T0++ was prompted to generate a question given a topic's description, (2) to assess the argument quality, and (3) to detect the stance in zero-shot settings. These two runs differed in whether a stance balancing was used. The team's most effective run in terms of relevance and quality used axiomatic re-ranking, and re-ranking based on the detected stance.

Asuna [118] preprocessed each document (passage) in the retrieval corpus by ( 1) creating a one-sentence extractive summary using LexRank [119], (2) identifying premises and claims with TARGER [98], and (3) looking up the spam score in the Waterloo Spam Rankings dataset [120]. 20The modified corpus was indexed, and initial retrieval of the top-40 documents was performed with the Pyserini [65] implementation of BM25F (default 𝑘 1 = 0.9 and 𝑏 = 0.4) using the unmodified queries (topic titles) over the index fields with original passages, summaries, and premises and claims. Next, the queries were lemmatized and stop words were removed using the NLTK library, and expanded with the most frequent terms coming from LDA topics [121] for the initially retrieved documents. The expanded queries were used to, again, retrieve the top-40 passages with BM25F. Finally, Asuna re-ranked the retrieved documents using a random forest classifier [122] with the following features: BM25F score, number of times the document was retrieved for different queries (original, three extended with the LDA topics for documents, and one extended with the LDA topic for the task topic description), number of tokens in documents, number of sentences in documents, number of premises in documents, number of claims in documents, spam-score, predicted argument quality score, and predicted stance. The classifier was trained on the Touché 2020 and 2021 relevance judgments. The argument quality was predicted using DistilBERT, fine-tuned on the Webis-ArgQuality-20 corpus [89]. The stance was also predicted using DistilBERT, fine-tuned on the provided stance dataset.

Task 3: Image Retrieval for Arguments

The goal of the Touché 2022 lab's third task was to provide argumentation support through image search. The retrieval of relevant images should provide both a quick visual overview of frequent arguments on some topic, and for compelling images to support one's argumentation. The goal of the third task was thus to retrieve images that indicate an agreement or disagreement to some stance on a given topic as two separate lists similar to textual argument search.

Task Definition and Data

Task. Given a controversial topic, the task was to retrieve images (from web pages) for each stance (pro and con) that show support for that stance.

Topics. Task 3 uses the same 50 controversial topics as Task 1 (cf. Section 4). Document collection. This task's document collection stems from a focused crawl of 23,841 images and associated web pages from late 2021. For each of the 50 topics, we issued 11 queries (with different filter words like "good, " "meme, " "stats, " "reasons, " or "effects") to Google's image search and downloaded the top 100 images and associated web pages; 868 duplicate images were identified and removed using pHash 21 and manual checks. The dataset contains for each image: (1) the image itself in both WebP and PNG format, (2) its URL; (3) its pHash. Moreover, the dataset contains for each page: (1) its URL; (2) the Google rank of the page for each query for which the image was retrieved; (3) a WARC web archive; 22 (4) a DOM HTML snapshot;

(5) its complete text; (6) a screenshot; (7) meta-information of each DOM node, including the node's xPath, CSS attributes, and position on the screenshot; and (8) the xPath of the corresponding image in the DOM HTML snapshot. The full dataset is 368 GB large. 23 To kickstart machine learning approaches, we provided 334 relevance judgments from Kiesel et al. [51].

Evaluation Setup

We employed crowdsourcing on Amazon Mechanical Turk 24 to evaluate the topical relevance, argumentativeness, and stance of the 6,607 image-topic pairs from all runs, employing 5 independent annotators each. Specifically, we asked for each topic for which an image was retrieved:

(1) Is the image in some manner related to the topic? (2) Do you think most people would say that, if someone shares this image without further comment, they want to show they approve of the pro-side to the topic? (3) Or do you think most people would rather say the one who shares this image does so to show they disapprove? We described each topic using the topic's title, modified as necessary to convey the description and narrative (cf. Table 1) and to clarify which stance is approve (pro) and disapprove (con). We then iteratively employed MACE [123] to identify image-topic pairs with low annotator agreement (MACE confidence ≤ 0.55) and re-judged them ourselves, employing our judgments as check instances for another iteration of MACE. We repeated this procedure until MACE predicted the labels for all image-topic pairs from the runs with a confidence above 0.55 (re-judging 2,056 images total).

Submitted Approaches and Evaluation Results

In total, 3 teams submitted 12 runs to this task. The teams pursued quite different approaches. However, all participants employed OCR (specifically Tesseract25 ) to extract image text. The

Table 5

Results of Task 3 (Image Retrieval for Arguments) in terms of Precision@10 (per stance) for topic relevance, argumentativeness, and stance relevance. The table shows the best run for each team across all three measures. Results for the baseline are shown in bold. A † indicates statistically significant differences to the baseline (paired Student's 𝑡-test, 𝑝 = 0.05, Bonferroni-correction). Table 12 shows the results for all submitted runs. teams Boromir and Jester also used the associated web page's text, but Team Jester restricted to text close to the image on the web page. Each team used sentiment or emotion features, based on image colors (Aramis), faces in the images (Jester), image text (all), and the web page text (Boromir, Jester). Boromir used the ranking information for internal evaluation. We used Precision@10 for evaluation: the ratio of relevant images among 10 retrieved images for each topic and stance. Table 5 shows the results of each team's most effective run. For each team, the best runs were the same with respect to all three measures.

Minsc represents our baseline run, which ranks images in the same order as our original Google queries, namely of the query that includes the filter word "good" for pro and of the query that includes "anti" for con. We considered this a tough baseline, especially for on-topic relevance, as topical relatedness is similar for argumentative and "standard" web image search. However, Boromir beat this baseline-with a considerable margin for on-topic relevance.

Aramis [124] focused on image features. No retrieval model was employed, but all images evaluated for each topic. They tested the use of a heuristic formula vs. fully-connected neural network classifiers for both argumentativeness and stance detection. Features were based on OCR (text length in characters, text area size, and cells in an 8×8 grid with high text density, VADER sentiment score [78]), image color (average color, dominant color, and percentages of pixels with each of these color ranges as per self-defined RGB-buckets: red, green, blue, yellow, light, and dark), image category (graphic vs. photo [125]; percentage of area covered by diagrams 26 ), and query-text similarity (whether the query is fully contained, the overlap for an optimal query alignment, and VADER sentiment score of words in a six-token radius around occurrences of query terms in the text). However, the query-text similarity features were not used for argumentativeness classification, as the team assumed this sub-task to be queryindependent. In our evaluation, the formula performed better than the neural approaches, which Aramis traced back to the formula being slightly better at handling off-topic images-with topical relevance not being the team's focus, they had trained and internally evaluated the network on on-topic images only. However, their worst runs still achieved a similar Precision@10 as their best one, namely 0.664 (topic; -0.037 compared to best run), 0.609 (argumentativeness; -0.025), and 0.344 (stance; -0.037). Moreover, for an evaluation that ignores the problem of topical relevance, the ratio of argumentative images among topical relevant images for their runs is between 0.904 (using both formulas) and 0.927 (using both networks), and thus very close to the baseline, which reaches a ratio of 0.932.

Boromir [126] indexed both image text (boosted five-fold) and web page text (Elasticsearch BM25 with default settings, 𝑘 1 = 1.2 and 𝑏 = 0.75), using lowercasing, URL, punctuation and number removal, NLTK's WordNet lemmatization [102], removal of tokens consisting of exactly one letter, stop word removal (using the list from NLTK), and min-frequency filtering (removing tokens that appear less than three times in the text). They clustered images into 13 clusters (as determined by the elbow criterion) using 𝑘-Means and manually assigned retrieval boosts per cluster to favor more argumentative images, especially diagrams. For example, clusters with the highest boost of 5.0 were found to contain, upon manual inspection, "graphics with text (e.g., memes, quotes, twitter posts)," "graphics with round forms and text (e.g., pie charts)," "statistical graphics but with better quality [...] (e.g., bar plots, tables, line plots)," and "statistical plots (bar plots and line plots)." On the other hand, not boosted were images from clusters that were found to contain mostly photos (five clusters). They employed textual sentiment detection for stance detection, using either a dictionary (AFINN [127]) or a BERT classifier. Their approach performed best and convincingly improved over the baseline. The BERT classifier improved over the dictionary-based classifier whereas image clustering was detrimental. Specifically, the image clustering seemed to introduce more off-topic images into the ranking: the same setup as the best run but using image clusters achieved a Precision@10 of 0.822 (topic; -0.056), 0.728 (argumentativeness; -0.040), and 0.411 (stance; -0.014).

Jester focused on emotion-based image retrieval via facial image recognition (using FER 27 ), image text, and the associated web page's text that is close to the image in the HTML source code-for which they use the text within the image's parent element. Similar to stance detection in the args.me search engine [14], they assign positive-leaning images to the pro-stance and negative-leaning images to the con-stance. For comparison, they submitted a second run without emotion features (thus plain retrieval), which achieved a lower Precision@10: 0.671 (topic; -0.025), 0.618 (argumentativeness; -0.029), and 0.336 (stance; -0.014). Thus emotion features seem helpful but insufficient when taken alone.

Conclusion

The third edition of the Touché lab at CLEF 2022 featured three shared tasks: (1) argument retrieval for controversial questions, (2) argument retrieval for comparative questions, and (3) image retrieval for arguments. Compared to previous editions, retrieval units have been changed (sentences/passages instead of full arguments/documents and images as a completely new unit) and stance detection has been included. Of 58 registered teams, 23 participated in the tasks and submitted at least one valid run. In addition to sparse retrieval and various query processing, reformulation, and expansion methods, approaches have increasingly focused on transformer models and re-ranking techniques. Not only was the quality of the documents and arguments evaluated, but also the predicted stance taken into account for the final rankings.

The most effective approaches to argument retrieval all share common characteristics. For example, most use various strategies for query reformulation and expansion, such as synonyms, relevance feedback, or generating new queries with pre-trained language models. An interesting observation is that re-ranking first-stage search results based on a quality assessment of the arguments almost always improves retrieval effectiveness. Specifically for Task 2 ( comparative questions), re-ranking based on important terms such as comparison objects and aspects or argument units in documents (premises and assertions) was successful. In task 2, stance detection was a new subtask, and some participants included a re-ranking step based on the predicted stance in their retrieval pipelines, which had some promising effects on retrieval effectiveness. However, the overall still rather low effectiveness of the approaches to stance detection leaves room for future improvements. For Task 3 (image retrieval), the recognition of sentiment and emotion and the use of OCR to analyze the text in images were particularly helpful.

We plan to continue Touché as a collaborative platform for researchers in argument retrieval. All Touché resources are freely available, including topics, manual relevance and argument quality assessments, and submitted runs from participating teams. These resources, the submission and evaluation tools, and other events such as workshops will help to further foster the community working on argument retrieval. In the future, we plan to expand the evaluation pools and to include additional dimensions of argument quality. Improving stance detection and exploiting predicted stances better not only for ranking text arguments but also for images are also interesting tasks for future work.

Table 11Example topic for Task 1: Argument Retrieval for Controversial Questions.Number34TitleAre social networking sites good for our society?

Table 33Example topic for Task 2: Argument Retrieval for Comparative Questions.Number88TitleShould I major in philosophy or psychology?Objectsmajor in philosophy, psychology

https://webis.de/events.html?q=Touche#shared-tasks https://www.args.me/ http://commoncrawl.org Three teams did not submit a paper describing their approach, though. The expected format was also described at the lab's web page: https://webis.de/events/touche-22/ https://webis.de/data.html#args-me-corpus https://www.args.me/api-en.html https://opennlp.apache.org/ https://github.com/igorbrigadir/stopwords/ https://connect.ebsco.com/s/article/What-are-stop-words-and-how-does-EBSCO-s-search-engine-handle-them? https://www.semrush.com/blog/seo-stop-words/ https://www.ranks.nl/stopwords https://www.datamuse.com/api/ https://99webtools.com/blog/list-of-english-stop-words/ https://lemurproject.org/clueweb12/index.php https://github.com/grill-lab/trec-cast-tools Also available as a Python library: https://pypi.org/project/targer-api/ http://www.dcs.gla.ac.uk/∼craigm/colbert.dnn.zip https://github.com/MatthiasWinkelmann/english-words-names-brands-places https://lemurproject.org/clueweb12/related-data.php https://www.phash.org/ Archived using https://github.com/webis-de/scriptor Available at https://webis.de/data.html#touche22-image-retrieval-for-arguments https://www.mturk.com https://github.com/tesseract-ocr/tesseract https://github.com/justinshenk/fer

Acknowledgments

We are very grateful to the CLEF 2022 organizers and the Touché participants, who allowed this lab to happen. We also want to thank our volunteer annotators who helped to create the relevance and argument quality assessments and our reviewers for their valuable feedback on the participants' notebooks.

This work was partially supported by the Deutsche Forschungsgemeinschaft (DFG) through the projects "ACQuA 2.0" (Answering Comparative Questions with Arguments; project number 376430233) and "OASiS" (Objective Argument Summarization in Search; project number 455913891) as part of the priority program "RATIO: Robust Argumentation Machines" (SPP 1999), and the German Ministry for Science and Education (BMBF) through the project "SharKI" (Shared Tasks as an Innovative Approach to Implement AI and Big Data-based Applications within Universities; grant FKZ 16DHB4021). We are also grateful to Jan Heinrich Reimer for developing the TARGER Python library and Erik Reuter for expanding a document collection for Task 2 with docT5query.

A. Full Evaluation Results of Touché 2022: Argument Retrieval Table 6

Relevance results of all runs submitted to Task 1: Argument Retrieval for Controversial Questions. Reported are the mean nDCG@5 and the 95% confidence intervals. The baseline Swordsman is shown in bold.

Table 7

Quality results of all runs submitted to Task 1: Argument Retrieval for Controversial Questions. Reported are the mean nDCG@5 and the 95% confidence intervals. The baseline Swordsman is shown in bold.

Table 8

Coherence results of all runs submitted to Task 1: Argument Retrieval for Controversial Questions.

Reported are the mean nDCG@5 and the 95% confidence intervals. The baseline Swordsman is shown in bold.

Table 9

Relevance results of all runs submitted to Task 2: Argument Retrieval for Comparative Questions.

Reported are the mean nDCG@5 and the 95% confidence intervals; Puss in Boots baseline in bold.

Team

Table 10

Quality results of all runs submitted to Task 2: Argument Retrieval for Comparative Questions. Reported are the mean nDCG@5 and the 95% confidence intervals; Puss in Boots baseline in bold. Reported are a macro-averaged F 1 for each team and run and number of documents N for which the stance was predicted; Puss in Boots baseline that always predicts 'no stance' is in bold.

Team

Table 12

Results of all runs submitted to Task 3 Image Retrieval. Reported are the mean precision@10 (per stance) for topic relevance, argumentativeness, and stance relevance and the 95% confidence intervals (low and high). Results for the baseline are shown in bold.

Overview of Touché 2022: Argument retrieval ABondarenko MFröbe JKiesel SSyed TGurcke MBeloucif APanchenko CBiemann BStein HWachsmuth MPotthast MHagen Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association (CLEF 2022) Lecture Notes in Computer Science

Berlin Heidelberg New York

Springer 2022 Overview of Touché 2020: Argument retrieval ABondarenko MFröbe MBeloucif LGienapp YAjjour APanchenko CBiemann BStein HWachsmuth MPotthast MHagen Working Notes Papers of the CLEF 2020 Evaluation Labs 2020 2696 CEUR Workshop Proceedings Overview of Touché 2021: Argument retrieval ABondarenko LGienapp MFröbe MBeloucif YAjjour APanchenko CBiemann BStein HWachsmuth MPotthast MHagen Working Notes Papers of the CLEF 2021 Evaluation Labs 2021 2936 CEUR Workshop Proceedings Okapi at TREC-3 SERobertson SWalker SJones MHancock-Beaulieu MGatford Proceedings of The Third Text REtrieval Conference, TREC 1994 The Third Text REtrieval Conference, TREC 1994 NIST 1994 Exploring the limits of transfer learning with a unified text-to-text transformer CRaffel NShazeer ARoberts KLee SNarang MMatena YZhou WLi PJLiu J. Mach. Learn. Res 21 67 2020 Multitask prompted training enables zero-shot task generalization VSanh AWebson CRaffel SHBach LSutawika ZAlyafeai AChaffin AStiegler TLScao ARaja MDey MSBari CXu UThakker SSharma ESzczechla TKim GChhablani NVNayak DDatta JChang MTJiang HWang MManica SShen ZXYong HPandey RBawden TWang TNeeraj JRozen ASharma ASantilli TFévry JAFries RTeehan SBiderman LGao TBers TWolf AMRush CoRR abs/2110.08207 2021 Pretrained Transformers for Text Ranking: BERT and Beyond, Synthesis Lectures on Human Language Technologies JLin RNogueira AYates 10.2200/S01123ED1V01Y202108HLT053 2021 Morgan & Claypool Publishers BERT: pre-training of deep bidirectional transformers for language understanding JDevlin MChang KLee KToutanova 10.18653/v1/n19-1423 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, ACL the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, ACL 2019 Distilling dense representations for ranking using tightlycoupled teachers SLin JYang JLin CoRR abs/2010.11386 2020 Simple BM25 extension to multiple weighted fields SERobertson HZaragoza MJTaylor 10.1145/1031171.1031181 Proceedings of the 13th International Conference on Information and Knowledge Management, CIKM 2004 the 13th International Conference on Information and Knowledge Management, CIKM 2004 ACM 2004 BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models NThakur NReimers ARücklé ASrivastava IGurevych Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) 2021 Simplified data wrangling with ir_datasets SMacavaney AYates SFeldman DDowney ACohan NGoharian 10.1145/3404835.3463254 Proceedings og the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021 og the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021 ACM 2021 Retrieval of the best counterargument without prior topic knowledge HWachsmuth SSyed BStein Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Association for Computational Linguistics the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Association for Computational Linguistics 2018 Building an argument search engine for the web HWachsmuth MPotthast KAl-Khatib YAjjour JPuschmann JQu JDorsch VMorari JBevendorff BStein 10.18653/v1/w17-5106 Proceedings of the Fourth Workshop on Argument Mining (ArgMining), Association for Computational Linguistics the Fourth Workshop on Argument Mining (ArgMining), Association for Computational Linguistics 2017 Identifying argumentative questions in web search logs YAjjour PBraslavski ABondarenko BStein 10.1145/3477495.3531864 45th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2022) ACM 2022 Data acquisition for argument search: The args.me corpus YAjjour HWachsmuth JKiesel MPotthast MHagen BStein 10.1007/978-3-030-30179-8_4 Proceedings of the 42nd German Conference on AI, KI 2019 Lecture Notes in Computer Science the 42nd German Conference on AI, KI 2019 Springer 2019 11793 Towards an argumentative content search engine using weak supervision RLevy BBogin SGretz RAharonov NSlonim Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Association for Computational Linguistics the 27th International Conference on Computational Linguistics, COLING 2018, Association for Computational Linguistics 2018 ArgumenText: Searching for arguments in heterogeneous sources CStab JDaxenberger CStahlhut TMiller BSchiller CTauchmann SEger IGurevych Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, NAACL 2018, Association for Computational Linguistics the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, NAACL 2018, Association for Computational Linguistics 2018 On Rhetoric: A Theory of Civic Discourse GAAristotle Kennedy 2006 Oxford University Press Oxford Argumentation quality assessment: Theory vs. practice HWachsmuth NNaderi IHabernal YHou GHirst IGurevych BStein 10.18653/v1/P17-2039 Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Association for Computational Linguistics the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Association for Computational Linguistics 2017 Argument search: Assessing argument relevance MPotthast LGienapp FEuchner NHeilenkötter NWeidmann HWachsmuth BStein MHagen 10.1145/3331184.3331327 Proceedings of the 42nd International Conference on Research and Development in Information Retrieval, SIGIR 2019 the 42nd International Conference on Research and Development in Information Retrieval, SIGIR 2019 ACM 2019 Efficient pairwise annotation of argument quality LGienapp BStein MHagen MPotthast 10.18653/v1/2020.acl-main.511 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 PageRank" for argument relevance HWachsmuth BStein YAjjour 10.18653/v1/e17-1105 Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Association for Computational Linguistics the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Association for Computational Linguistics 2017 The PageRank Citation Ranking: Bringing Order to the Web LPage SBrin RMotwani TWinograd 1999-66 1999 Stanford InfoLab Technical Report A framework for argument retrieval -ranking argument clusters by frequency and specificity LDumani PJNeumann RSchenkel 10.1007/978-3-030-45439-5_29 Proceedings of the 42nd European Conference on IR Research (ECIR 2020) Lecture Notes in Computer Science the 42nd European Conference on IR Research (ECIR 2020) Springer 2020 12035 Quality aware ranking of arguments LDumani RSchenkel 10.1007/978-3-030-45439-5_29 Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM '20 the 29th ACM International Conference on Information & Knowledge Management, CIKM '20 Association for Computing Machinery 2020 Computational argumentation quality assessment in natural language HWachsmuth NNaderi YHou YBilu VPrabhakaran TAThijm GHirst BStein Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 2017 A comparative web browser (CWB) for browsing and comparing web pages ANadamoto KTanaka 10.1145/775152.775254 Proceedings of the 12th International World Wide Web Conference, WWW 2003 the 12th International World Wide Web Conference, WWW 2003 ACM 2003 CWS: A comparative web search system JSun XWang DShen HZeng ZChen 10.1145/1135777.1135846 Proceedings of the 15th International Conference on World Wide Web, WWW 2006 the 15th International Conference on World Wide Web, WWW 2006 ACM 2006 Identifying comparative sentences in text documents NJindal BLiu 10.1145/1148170.1148215 Proceedings of the 29th Annual International Conference on Research and Development in Information Retrieval, SIGIR 2006 the 29th Annual International Conference on Research and Development in Information Retrieval, SIGIR 2006 ACM 2006 Mining comparative sentences and relations NJindal BLiu Proceedings of the 21st National Conference on Artificial Intelligence and the 18th Innovative Applications of Artificial Intelligence Conference, AAAI 2006 the 21st National Conference on Artificial Intelligence and the 18th Innovative Applications of Artificial Intelligence Conference, AAAI 2006 AAAI Press 2006 A corpus of comparisons in product reviews WKessler JKuhn Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014, European Language Resources Association (ELRA) the 9th International Conference on Language Resources and Evaluation, LREC 2014, European Language Resources Association (ELRA) 2014 Categorizing comparative sentences APanchenko ABondarenko MFranzek MHagen CBiemann 10.18653/v1/w19-4516 Proceedings of the 6th Workshop on Argument Mining, ArgMining@ACL 2019, Association for Computational Linguistics the 6th Workshop on Argument Mining, ArgMining@ACL 2019, Association for Computational Linguistics 2019 Entity-aware dependency-based deep graph attention network for comparative preference classification NMa SMazumder HWang BLiu Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Association for Computational Linguistics 2020 Answering comparative questions: Better than ten-blue-links? MSchildwächter ABondarenko JZenker MHagen CBiemann APanchenko 10.1145/3295750.3298916 Proceedings of the 2019 Conference on Human Information Interaction and Retrieval the 2019 Conference on Human Information Interaction and Retrieval

CHIIR

ACM 2019. 2019 Which is better for deep learning: Python or matlab? answering comparative questions in natural language VChekalina ABondarenko CBiemann MBeloucif VLogacheva APanchenko Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, EACL 2021, Association for Computational Linguistics the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, EACL 2021, Association for Computational Linguistics 2021 Comparative web search questions ABondarenko PBraslavski MVölske RAly MFröbe APanchenko CBiemann BStein MHagen 10.1145/3336191.3371848 Proceedings of the 13th ACM International Conference on Web Search and Data Mining, WSDM 2020 the 13th ACM International Conference on Web Search and Data Mining, WSDM 2020 ACM 2020 Towards understanding and answering comparative questions ABondarenko YAjjour VDittmar NHomann PBraslavski MHagen 10.1145/3488560.3498534 Proceedings of the 15th ACM International Conference on Web Search and Data Mining, WSDM 2022 the 15th ACM International Conference on Web Search and Data Mining, WSDM 2022 ACM 2022 On images as evidence and arguments IJDove 10.1007/978-94-007-4041-9_15 Topical Themes in Argumentation Theory: Twenty Exploratory Studies, Argumentation Library FHVan Eemeren BGarssen

Netherlands; Dordrecht

Springer 2012 Images, emotions, politics FDunaway 10.1017/mah.2018.17 Modern American History 1 2018 Visual argumentation: A further reappraisal GRoque 10.1007/978-94-007-4041-9_18 doi: Topical Themes in Argumentation Theory FHVan Eemeren BGarssen

Netherlands; Dordrecht

Springer 2012 22 series Title: Argumentation Library Types of visual arguments, Argumentum IGrancea Journal of the Seminar of Discursive Logic, Argumentation Theory and Rhetoric 15 2017 Martino, SemEval-2021 task 6: Detection of persuasion techniques in texts and images DDimitrov BBin SAli FShaar FAlam HSilvestri PFirooz GNakov Da San 10.18653/v1/2021.semeval-1.7 15th International Workshop on Semantic Evaluation (SemEval'2021), Association for Computational Linguistics 2021 Query-by-pictorial-example NChang KFu 10.1109/TSE.1980.230801 IEEE Transactions on Software Engineering 6 1980 Content-based representation and retrieval of visual media: A state-of-the-art review PAigrain HZhang DPetkovic 10.1007/BF00393937 Multimedia Tools and Applications 3 1996 Content-based image retrieval and feature extraction: A comprehensive review ALatif ARasheed USajid JAhmed NAli NIRatyal BZafar SHDar MSajid TKhalil 10.1155/2019/9658350 Mathematical Problems in Engineering 21 2019. 2019 AWu Learn more about what you see on google images, Google Blog 2020 Google, Google images best practices, Google Developers 2021 A survey on emotional semantic image retrieval WWang QHe 10.1109/ICIP.2008.4711705 International Conference on Image Processing (ICIP 2008 IEEE 2008 Color emotions for multi-colored images MSolli RLenz 10.1002/col.20604 Color Research & Application 36 2011 Image retrieval for arguments using stance-aware query expansion JKiesel NReichenbach BStein MPotthast Proceedings of the 8th Workshop on Argument Mining, ArgMining 2021 at EMNLP, ACL the 8th Workshop on Argument Mining, ArgMining 2021 at EMNLP, ACL 2021 TIRA integrated research architecture MPotthast TGollub MWiegmann BStein 10.1007/978-3-030-22948-1_5 Information Retrieval Evaluation in a Changing World -Lessons Learned from 20 Years of CLEF Springer 2019 41 The Information Retrieval Series Extractive snippet generation for arguments MAlshomary NDüsterhus HWachsmuth 10.1145/3397271.3401186 Proceedings of the 43nd International ACM Conference on Research and Development in Information Retrieval, SIGIR 2020 the 43nd International ACM Conference on Research and Development in Information Retrieval, SIGIR 2020 ACM 2020 TrecTools: an Open-source Python Library for Information Retrieval Practitioners Involved in TREC-like Campaigns JR MPalotti HScells GZuccon 10.1145/3331184.3331399 Proceedings of the 42nd International Conference on Research and Development in Information Retrieval, SIGIR 2019 the 42nd International Conference on Research and Development in Information Retrieval, SIGIR 2019 ACM 2019 Project debater apis: Decomposing the AI grand challenge RBar-Haim YKantor EVenezian YKatz NSlonim 10.18653/v1/2021.emnlp-demo.31 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2021, Online and the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2021, Online and

Punta Cana, Dominican Republic

7-11 November, 2021. 2021 Association for Computational Linguistics Classification and clustering of arguments with contextualized word embeddings NReimers BSchiller TBeck JDaxenberger CStab IGurevych 10.18653/v1/P19-1054 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics

Florence, Italy

2019 A large-scale dataset for argument quality ranking: Construction and analysis SGretz RFriedman ECohen-Karlik AToledo DLahav RAharonov NSlonim The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020

EAAI

AAAI Press 2020. 2020 The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence Distributed representations of sentences and documents QLe TMikolov International conference on machine learning

PMLR

2014 spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing MHonnibal IMontani To appear 7 2017 Sentence-BERT: Sentence embeddings using siamese bertnetworks NReimers IGurevych 10.18653/v1/D19-1410 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Association for Computational Linguistics the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Association for Computational Linguistics 2019 Language models are unsupervised multitask learners ARadford JWu RChild DLuan DAmodei ISutskever OpenAI blog 1 9 2019 Using BERT to retrieve relevant and argumentative sentence pairs PSülzle NWenzlitschke Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 SE-UPD@CLEF: Team INTSEG on argument retrieval for controversial questions SBahrami GPGoli APasin NRajkumari MMSohail PTahan NFerro Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 Team Bruce Banner at Touché 2022: Argument retrieval for controversial questions BMoreira HCardoso BMartins FGoularte Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations JLin XMa SLin JYang RPradeep RNogueira 10.1145/3404835.3463238 Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021 the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021 ACM 2021 SEUPD@CLEF: Team 6musk on argument retrieval for controversial questions by using pairs selection and query expansion LCappellotto MLando DLupu MMariotto RRosalen NFerro Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 An algorithm for suffix stripping MFPorter 10.1108/eb046814 Program 14 1980 Viewing morphology as an inference process RKrovetz 10.1145/160688.160718 doi:10.1145/160688.160718 Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '93 the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '93

New York, NY, USA

Association for Computing Machinery 1993 RCV1: A new benchmark collection for text categorization research DDLewis YYang TGRose FLi J. Mach. Learn. Res 5 2004 WordNet: A lexical database for English GAMiller Communications of the ACM 38 1995 Efficient estimation of word representations in vector space TMikolov KChen GCorrado JDean Proceedings of the 1st International Conference on Learning Representations, ICLR 2013 the 1st International Conference on Learning Representations, ICLR 2013 2013 SE-UPD@CLEF: Team Gamora on argument retrieval for controversial questions ABenetti MDTogni GFoti RLacini AMatteazzi ESgarbossa NFerro Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 Glove: Global vectors for word representation JPennington RSocher CDManning 10.3115/v1/d14-1162 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014 AMoschitti BPang WDaelemans the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014

Doha, Qatar

October 25-29, 2014. 2014 , A meeting of SIGDAT, a Special Interest Group of the ACL, ACL SEUPD@CLEF: Team Lgtm on argument retrieval for controversial questions MBarusco GDFiume RForzan MGPeloso NRizzetto ESoleymani NFerro Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 How effective is suffixing? DHarman Journal of the american society for information science 42 1991 The stanford corenlp natural language processing toolkit CDManning MSurdeanu JBauer JRFinkel SBethard DMcclosky Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations 52nd annual meeting of the association for computational linguistics: system demonstrations 2014 Automatic keyword extraction from individual documents SRose DEngel NCramer WCowley Text mining: applications and theory 1 2010 Vader: A parsimonious rule-based model for sentiment analysis of social media text CHutto EGilbert Proceedings of the international AAAI conference on web and social media the international AAAI conference on web and social media 2014 8 Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel JPKincaid RPFishburneJr RLRogers BSChissom 1975 Naval Technical Training Command Millington TN Research Branch Technical Report SEUPD@CLEF: SPAM on argument retrieval for controversial questions MSEbrahimi ACrivellari ASah MHansen SMehrbanou PAshok NFerro Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 Similar but different: Simple re-ranking approaches for argument retrieval JWuerf Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 The use of mmr, diversity-based reranking for reordering documents and producing summaries JGCarbonell JGoldstein 10.1145/290941.291025 SIGIR '98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval WBCroft AMoffat CJVan Rijsbergen RWilkinson JZobel

Melbourne, Australia

ACM August 24-28 1998. 1998 From word embeddings to document distances MJKusner YSun NIKolkin KQWeinberger Proceedings of the 32nd International Conference on Machine Learning, ICML 2015 FRBach DMBlei the 32nd International Conference on Machine Learning, ICML 2015

Lille, France

6-11 July 2015. 2015 37 JMLR Workshop and Conference Proceedings Finding pairs of argumentative sentences using embeddings CVTa FReiner IVon Detten FStöhr Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 Two-stage retrieval for pairs of argumentative sentences SSchmidt JProbst BBartelt AHinz Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 A study of smoothing methods for language models applied to ad hoc information retrieval CZhai JDLafferty 10.1145/383952.384019 Proceedings of the 24th International Conference on Research and Development in Information Retrieval, SIGIR 2001 the 24th International Conference on Research and Development in Information Retrieval, SIGIR 2001 ACM 2001 Frequentist and bayesian approach to information retrieval GAmati 10.1007/11735106_3 Advances in Information Retrieval, 28th European Conference on IR Research, ECIR 2006 Lecture Notes in Computer Science MLalmas AMacfarlane SMRüger ATombros TTsikrika AYavlinsky

London, UK

Springer April 10-12, 2006. 2006 3936 Proceedings Declarative experimentation in information retrieval using pyterrier CMacdonald NTonellotto 10.1145/3409256.3409829 doi:10.1145/3409256.3409829 ICTIR '20: The 2020 ACM SIGIR International Conference on the Theory of Information Retrieval, Virtual Event KBalog VSetty CLioma YLiu MZhang KBerberich

, Norway

ACM September 14-17, 2020. 2020 Efficient pairwise annotation of argument quality LGienapp BStein MHagen MPotthast Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Association for Computational Linguistics 2020 Probabilistic models of information retrieval based on measuring the divergence from randomness GAmati CJVan Rijsbergen 10.1145/582415.582416 doi:10.1145/582415.582416 ACM Trans. Inf. Syst 20 2002 Passage retrieval revisited MKaszkiel JZobel 10.1145/258525.258561 Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1997 NJBelkin ADNarasimhalu PWillett WRHersh FCan EMVoorhees the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1997

Philadelphia, PA, USA

ACM July 27-31, 1997. 1997 Copycat: Near-duplicates within and between the clueweb and the common crawl MFröbe JBevendorff LGienapp MVölske BStein MPotthast MHagen 10.1145/3404835.3463246 Proceedings of the 44th International ACM Conference on Research and Development in Information Retrieval, SIGIR 2021 the 44th International ACM Conference on Research and Development in Information Retrieval, SIGIR 2021 ACM 2021 Sampling bias due to nearduplicates in learning to rank MFröbe JBevendorff JReimer MPotthast MHagen 10.1145/3397271.3401212 Proceedings of the 43rd International ACM Conference on Research and Development in Information Retrieval, SIGIR 2020 the 43rd International ACM Conference on Research and Development in Information Retrieval, SIGIR 2020 ACM 2020 The effect of content-equivalent nearduplicates on the evaluation of search engines MFröbe JBittner MPotthast MHagen Proceedings of the 42nd European Conference on IR Research (ECIR 2020) Lecture Notes in Computer Science the 42nd European Conference on IR Research (ECIR 2020) Springer 2020 12036 RNogueira JLin AEpistemic From doc2query to doctttttquery 2019 Online preprint MS MARCO: A human generated MAchine Reading COmprehension dataset TNguyen MRosenberg XSong JGao STiwary RMajumder LDeng Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches 2016 at NIPS CEUR Workshop Proceedings the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches 2016 at NIPS 2016 1773 ALBERT: A lite BERT for self-supervised learning of language representations ZLan MChen SGoodman KGimpel PSharma RSoricut Proceedings of the 8th International Conference on Learning Representations, ICLR 2020 the 8th International Conference on Learning Representations, ICLR 2020

OpenReview

2020 TARGER: Neural argument mining at your fingertips AChernodub OOliynyk PHeidenreich ABondarenko MHagen CBiemann APanchenko 10.18653/v1/p19-3031 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, ACL the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, ACL 2019 ColBERT: efficient and effective passage search via contextualized late interaction over BERT OKhattab MZaharia 10.1145/3397271.3401075 Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020 JHuang YChang XCheng JKamps VMurdock JWen YLiu the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020 ACM 2020 The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models RPradeep RNogueira JLin CoRR abs/2101.05667 2021 LeviRank: Limited query expansion with voting integration for document retrieval and ranking ARana PGolchha RJuntunen ACoajă AElzamarany C.-CHung SPPonzetto Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 Natural Language Processing with Python SBird EKlein ELoper 2009 O'Reilly RoBERTa: A robustly optimized BERT pretraining approach YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov CoRR abs/1907.11692 2019 A broad-coverage challenge corpus for sentence understanding through inference AWilliams NNangia SRBowman 10.18653/v1/n18-1101 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, Association for Computational Linguistics the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, Association for Computational Linguistics 2018 Argument retrieval for comparative question MAba MAzra MGallo OMohammad IPiacere GVirginio NFerro AldoNadi Touché Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022. 2022 Snowball: A language for stemming algorithms MFPorter 2001 Relevance feedback in information retrieval, The Smart retrieval systemexperiments in automatic document processing JRocchio 1971 Reciprocal rank fusion outperforms condorcet and individual rank learning methods GVCormack CL AClarke SBüttcher 10.1145/1571941.1572114 Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009 the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009 ACM 2009 Retrieving comparative arguments using deep language models VChekalina APanchenko Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 SEUPD@CLEF: Team Hextech on argument retrieval for comparative questions. the importance of adjectives in documents quality evaluation AChimetto DPeressoni ESabbatini GTommasin MVarotto AZanardelli NFerro Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 Development of a stemming algorithm JBLovins Mech. Transl. Comput. Linguistics 11 1968 Stacked model based argument extraction and stance detection using embedded LSTM model PRajula C.-CHung SPPonzetto Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter VSanh LDebut JChaumond TWolf CoRR abs/1910.01108 2019 Distilbert-based argumentation retrieval for answering comparative questions AAlhamzeh MBouhaouel EEgyed-Zsigmond JMitrovic WS.org Proceedings of the Working Notes of CLEF 2021 -Conference and Labs of the Evaluation Forum CEUR Workshop Proceedings the Working Notes of CLEF 2021 -Conference and Labs of the Evaluation Forum 2936. 2021 Grimjack at Touché 2022: Axiomatic re-ranking and query reformulation JHReimer JHuck ABondarenko Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 Webis at TREC 2019: Decision track ABondarenko MFröbe VKasturia MVölske BStein MHagen Proceedings of the 28th International Text Retrieval Conference, TREC 2019 EVoorhees AEllis the 28th International Text Retrieval Conference, TREC 2019 NIST 2019 Webis at TREC 2020: Health Misinformation track JBevendorff ABondarenko MFröbe SGünther MVölske BStein MHagen Proceedings of the 29th International Text Retrieval Conference, TREC 2020 EVoorhees AEllis the 29th International Text Retrieval Conference, TREC 2020 NIST 2020 Quality-aware argument re-ranking for comparative questions PRösner NArnhold TXylander Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 LexRank: Graph-based lexical centrality as salience in text summarization GErkan DRRadev 10.1613/jair.1523 J. Artif. Intell. Res 22 2004 Efficient and effective spam filtering and re-ranking for large web datasets GVCormack MDSmucker CL AClarke 10.1007/s10791-011-9162-z Inf. Retr 14 2011 Latent dirichlet allocation DMBlei AYNg MIJordan J. Mach. Learn. Res 3 2003 Random decision forests TKHo 10.1109/ICDAR.1995.598994 Third International Conference on Document Analysis and Recognition, ICDAR 1995

Montreal, Canada

IEEE Computer Society August 14 -15, 1995. 1995 Learning whom to trust with MACE DHovy TBerg-Kirkpatrick AVaswani EHovy Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HTL 2013), Association for Computational Linguistics the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HTL 2013), Association for Computational Linguistics

Atlanta, Georgia

2013 Aramis at Touché 2022: Argument detection in pictures using machine learning JBraker LHeinemann TSchreieder Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 Distinguishing cartoons images from real-life images MZaid LGeorge GAl-Khafaji International Journal of Advanced Research in Computer Science and Software Engineering 5 2015 Boromir at Touché 2022: Combining natural language processing and machine learning techniques for image retrieval for arguments TBrummerloh MLCarnot SLange GPfänder Working Notes Papers of the CLEF 2022 Evaluation Labs, CEUR Workshop Proceedings 2022 A new ANEW: evaluation of a word list for sentiment analysis in microblogs FÅNielsen WS.org Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts CEUR Workshop Proceedings MRowe MStankovic ADadzie MHardey the ESWC2011 Workshop on 'Making Sense of Microposts 2011 718