=Paper=
{{Paper
|id=Vol-3180/paper-265
|storemode=property
|title=Similar but Different: Simple Re-ranking Approaches for Argument Retrieval
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-265.pdf
|volume=Vol-3180
|authors=Jerome Würf
|dblpUrl=https://dblp.org/rec/conf/clef/Wurf22
}}
==Similar but Different: Simple Re-ranking Approaches for Argument Retrieval==
Similar but Different: Simple Re-ranking Approaches for Argument Retrieval Notebook of Team Hit-Girl for the Touché Lab on Argument Retrieval at CLEF 2022 Jerome Würf 1 Leipzig University, Augustusplatz 10, 04109 Leipzig, Germany Abstract This work examines simple re-ranking approaches using the preprocessed args.me corpus to contribute to the Touché 2022 Argument Retrieval Task. The proposed retrieval system relies on an initial retrieval using a semantic search on a sentence level and takes advantage of simple heuristics. Our re-ranking approaches incorporate maximal marginal relevance, word mover’s distance, and a novel approach based on a fuzzy matching on part of speech tags that we call structural distance. Further, we explore the applicability of a graph-based re-ranking approach. The results show that the proposed re-ranking approaches could beat our baseline. For relevance, our re-ranking using structural distance performs best, while for quality, the one using the word mover’s distance achieves the highest score. Keywords information retrieval, argument retrieval, semantic search, re-ranking, Touché 2022 1. Introduction The waves of protests in response to the pandemic restrictions of last winter seem to highlight a problem in the current culture of discussion. Despite an increased exposure to facts on controversial topics through our daily lives, we fail to present the gained knowledge to enable debates and to support individuals’ opinion formation. Regarding COVID-19, it has been shown that people exposed to misinformation, biased media, and conspiracy have lower trust in democratic institutions [1]. This situation makes it urgent for societies to confront misinformed individuals with reasonable arguments. Besides COVID-19, web resources, like blogs and news sites, address many other topics with a similar, potentially harmful impact. This development motivates our research on the automatic retrieval of reasonable arguments. This work, describes the submission of team Hit-Girl1 for Task 1 of Touché 2022 [2]. The task asks participants to create an argument retrieval system for a given corpus to support the opinion formation on controversial societal topics. In this year’s version of the first task, the requirements for the final systems differ from the previous years, as participants are asked to retrieve argumentative sentence pairs instead of whole arguments for a given topic. The sentence pair is reasonable if the retrieved sentences are topic-relevant and qualitative. The quality of arguments is defined by (1) the argumentativeness of each sentence, (2) coherence CLEF’22: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://en.wikipedia.org/wiki/Hit-Girl between the sentences, and (3) together, the sentences of the pair should form a summary of their originating arguments [2]. Our proposed system consists of three main components: indexing, initial retrieval, and re-ranking. The system’s source code is publicly available2 . Before indexing, sentences of the provided preprocessed args.me corpus [3] are transformed into vector embeddings. Sentences and vector embeddings are stored into two indices, one holds only premises,/ and the other holds only conclusions. We conduct a nearest neighbor search in the embedding space at retrieval time. Initially, we rank according to the cosine similarity between the query embedding and the embeddings in the respective index. This approach should maximize the semantic similarity between sentences, resulting in topic-relevant sentences. In the following, we will refer to this as semantic search. Finally, we compare multiple re-ranking approaches that aim to balance relevance and diversification of query results by assessing differences between a query and the retrieved sentences. Having outlined our initial motivation and a rough system overview of how we approach the given task, we pose the following research question: Do simple, argument quality agnostic re-ranking approaches improve argument quality compared to an initial semantic search? To answer our research question, we conducted experiments with three different re-ranking approaches utilizing maximal marginal relevance (MMR), structural distance (SD), and word mover’s distance (WMD). In comparison to the baseline, two re-ranking approaches could increase argument relevance. WMD results in a better quality score than the baseline, and all re-ranking approaches show better sentence coherence than the baseline. Further, we analyze the challenges of implementing a graph-based argument re-ranking approach. Section 2 will introduce the related work. Following the related work, we describe our system and re-ranking approaches in section 3. Section 4 presents the evaluation of our experiments. 2. Related Work This section introduces the challenge of argument retrieval and describes existing re-ranking approaches. We pick up on the shortcomings of previous studies to justify the design of our system. 2.1. Challenges in argument retrieval Search engines for argument retrieval for controversial topics aim to quickly and comprehen- sively provide users with supportive and opposing arguments. Argument search denotes a relatively new field of research. It unites challenges of natural language processing and infor- mation retrieval while opening up a broad range of research opportunities for computational argumentation [4]. In contrast to relevance orientated search engines, systems for argument retrieval additionally needs to focus on: • incorporating the quality of the arguments to check for their validity 2 https://git.informatik.uni-leipzig.de/hit-girl/code • providing an overview of arguments with different stances instead of a single best answer • assessing and reflecting the connections between arguments in the final ranking 2.2. Existing methods in argument retrieval ArgumenText [5] and args [4] are important pioneers offering diverse technical approaches to the outlined challenges of argument retrieval. ArgumentText [5] was one of the first systems ingesting heterogeneous Web documents, identifying arguments in topic-relevant documents and labeling the identified arguments with a "pro" or "con" stance. The identification of argu- ments relies on an attention-based neural network, and a stance recognition utilizes a BiLSTM model. Both models were trained on a dataset containing 49 topics with 600 sentences each, labeled as "pro", "con" or not an argument. The authors compare their system’s performance to an expert-curated list of arguments within a specific online debate portal 3 and reported that on three selected topics, the retrieved arguments matched 89% the ones of the expert-curated list. Further, they pointed out that 12% of the arguments identified by their approach were not contained in the expert-curated list. ArgumentText [5] differs from our system as we are using a preprocessed dataset that does already contain arguments that are split into their constituent sentences. Further, these sentences are also already labeled by a stance. Therefore, our system only relies on initial retrieval and re-ranking approaches. Args [4] is a prototype argument retrieval system using a novel argument search framework and a newly crawled Web-based corpus [3]. The framework incorporates a common argument model. In this model, one argument consists of a claim/conclusion, zero or more premises, and an argument’s context, which provides the full text in which a specific argument occurred. In general, the framework splits into an indexing process and a retrieval process. The indexing process contains the acquisition of documents, argument mining, an assessment, and indexing. For the initial acquisition, the authors crawl the args.me [3] corpus. The crawl focuses on five different debate portals and includes 34,784 debates containing 291,440 arguments that were finally parsed into 329,791 argument units. Argument mining and parsing into the common argument model rely on Apache UIMA4 . The final indexing is realized with Apache Lucene. In the retrieval process, the args prototype performs an initial retrieval for a given query, relying on an exact string match between query terms and terms in an indexed argument and conducts a ranking on relevant arguments using a BM25 model. To be more specific, a BM25F model was used to weigh the individual components of the common argument model. The authors performed a quantitative analysis using controversial topics from Wikipedia as queries. The scores were reported on the systems’ coverage for logical combinations of query terms and phrase queries and the three components of the proposed common argument model: conclusions, arguments, and argument’s context. Finally, the system achieved a good initial coverage ranging from 41.6%–84.6% for all query types on the conclusions and a coverage of 77.6% on phrase queries for whole arguments. The results indicate that a retrieval model with a higher weight on conclusions reaps arguments of higher relevance. Our system uses a preprocessed version of the args.me [3] corpus. To be specific, our system indexes sentences that were gained from the argument mining and the assessment step of the args search engine. Like args, our system’s 3 https://ProCon.org 4 https://uima.apache.org initial retrieval and re-ranking approaches will not rely on identifying argumentative structures within the indexed argument units. In contrast to args, we use two indices, one for conclusions and one for premises, instead of indexing whole arguments at once. Motivated by the findings of the args search engine that the conclusions should have higher weight, our system queries our conclusion index first and uses the retrieved conclusions to query the premises index. Furthermore, our system enforces a minimum amount of tokens in a retrieved conclusion compared to a query. This constraint is also motivated by the expectations of args’ authors, "that the most relevant arguments need some space to lay out their reasoning"[4]. Previous years of Touché showed substantial improvements in retrieval performance. In the first year, multiple submissions indicated that the DirichletLM [6] retrieval model is a strong baseline for the initial retrieval of argumentative text [7]. Additionally, query expansion mecha- nisms were deployed to increase recall. Submissions for the second round of Touché indicated that argument-aware re-ranking approaches using fine-tuned language models improved previ- ous years’ results. Moreover, approaches focused on parameter tuning of pipelines proposed in the previous year, using existing relevance judgments [8]. Up to now, only a minority of Touché’s submissions [9, 10] leveraged embeddings for an initial retrieval, which motivates us to gain a deeper understanding of this approach. Motivated by the promising results of query expansion of last year’s submissions [11, 12, 13], our system mimics a query expansion by first retrieving conclusions with an initial controversial topic and then using these conclusions to query an index holding the premises. Finally, our re-ranking distinguishes us from existing ones, as we do not rely on argument-specific domain features or machine learning methods. 3. Methodological Approach The architecture of our retrieval system (Figure 1) consists of indexing, retrieval, and a re- ranking module. The system relies on two Elasticsearch5 indices, one for conclusions and one for premises. Initially, our system uses the preprocessed args.me[3] corpus, which holds arguments divided into their constituent premise and conclusion sentences. The sentences are transformed into vector embeddings (Section 3.1). Premises and conclusions are indexed into the respective indices with their vector embeddings. While indexing, the standard tokenization pipeline of Elasticsearch is applied to save the number of tokens in a sentence as further metadata. The retrieval module generates an initial ranking for premise and conclusion pairs. First, it queries the conclusion for a given controversial topic. Next, each conclusion serves as a query for the premise index. The initial retrieval scores are based on the cosine similarity between the vector embeddings of a query and the indexed sentences, thus mimicking the nearest neighbor search in the embedding space. Additionally, we introduce the hard constraint that a retrieved sentence must have at least 1.75 times the number of tokens of the query. The exact value was chosen by convention. In the following, we will refer to this constraint as token factor. The usage of a token factor is motivated by previous findings suggesting that qualitative argumentative sentences are longer than non argumentative ones[4]. By convention, we retrieve 100 conclusions and 50 premises per conclusion. A primary motivation behind this 5 https://github.com/elastic/elasticsearch Figure 1: System architecture: Besides the preprocessing, our system splits into three parts: Indexing of sentences and embeddings, an initial retrieval on different controversial topics given a configuration, that determines the re-ranking strategy. two-step retrieval is an expected increase in the premises recall, as we query the premise index multiple times using different conclusions. Finally, the re-ranking module scores conclusions and premises separately using three differ- ent methods, which will be explained in section 3.2. In general, these methods should improve the ranking with respect to the argumentative quality of the retrieved sentences by calculating new ranking scores between the query and initially ranked sentences. Lastly, our system gener- ates a text file in the "standard" TREC format. When writing the output file, we enforce on a topic level that there are no duplicates in the retrieved premises, and premises must match the stance of a conclusion. 3.1. Preprocessing The organizers of the shared task provide preprocessed args.me [3] corpus that contains the constituent sentences of each argument of the original args.me corpus. Further, it contains context meta-data for each argument and the stance for each sentence. Initially, we transform the provided preprocessed corpus into a structured parquet file. One row of this flat file corresponds to one sentence. One row holds information about the argument ID, sentence number6 , stance towards a topic, sentence text, and the sentence type, either conclusion or premise. The flat file contains 6,123,792 sentences that split into 338,595 conclusions and 5,785,197 premises. The original argument model of args.me [3] associates one conclusion with many premises, explaining the difference in the cardinality between conclusions and premises. Our approach breaks this association and combines the premises and conclusions of different arguments. We deduplicate the sentences using an exact string match. For the conclusions, we count 328,474 6 The sentence number presents the sentence’s index in the array of premises of an argument within the preprocessed args.me duplicates with 54,512 unique ones, which result, together with the non-duplicated ones, in a total of 64,633 conclusions to index. For the premises, we count 770,876 duplicates with 273,593 unique duplicates, which results in the non-duplicated ones in 5,626,509 premises to index. The high number of duplicates of conclusions arises from the parsing of debate platforms. Conclusions are often simply the headline of a post on a controversial topic, and a single post contains multiple arguments. The duplicated premises arise from direct citations between different posts. As a final preprocessing step, each sentence is encoded into a vector embedding via an out-of-the-box MiniLM [14] language model7 utilizing the sentence transformers library [15]. 3.2. Re-ranking approaches To improve the argument quality of our initial retrieval, we examine three different re-ranking approaches using existing implementations. Our re-ranking approaches do not rely on argument- specific sentence features. Due to the two-step retrieval approach of our system, re-ranking scores of conclusions and premises are calculated separately. First, we re-rank the conclusions, then each set of premises is retrieved for a conclusion. Each approach combines the respective re-ranking score with the initial ranking score using a weighted sum (Section 3.2.1). We expect that this general approach improves the argumentative quality of sentences by ensuring that the top results differ from the original query. Furthermore, we explore the challenges of a graph-based argument relevance for re-ranking. Maximal Marginal Relevance The first sentence pairs of the results of our initial re-ranking were very similar, only differing in single words. Motivated by this observation we implement the maximal marginal relevance (MMR) [16]. The MMR linearly combines query relevance and the information novelty of a document within a ranking. The factor of information novelty ensures the assessed score of a document incorporates the dissimilarity towards the previously chosen ones. The tradeoff between query relevance and information novelty is controlled by a parameter 𝜆. For our experiments, we assess different 𝜆 values. Our system calculates an MMR score for each sentence in the set of conclusions and each sentence in the individual sets of premises separately. The MMR for the conclusions calculates the query relevance between a specific conclusion and the given controversial topic and the information novelty of a specific conclusion within the set of already re-ranked ones. This approach is also conducted for every premise of the individual premise sets, where the respective conclusion serves as a query. Structural distance As a second re-ranking approach, we propose a re-ranking based on the structural distance (SD) between query and retrieved sentences. The SD should impose a penalty on retrieved sentences that merely rephrase the search query by synonyms, thus boosting the scores of sentences that have a different structure than the query. We define the structure of a sentence as a list 7 https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 of part-of-speech tags8 generated by pre-trained pipeline9 using the spaCy NLP library10 . The calculations for SD (Equation 1) are closely related to the Jaro similarity. Using the part-of- speech tags of a query 𝑞 and a retrieved sentence 𝑠, we calculate the Jaro similarity 𝑠𝑖𝑚(𝑞, 𝑠) on a tag instead of a character level. The standard Jaro similarity uses the length of each string, the number of matching characters, and the number of transposed characters between both in a specific interval. We adapt this to the total number of part-of-speech tags of query |𝑞| and sentence |𝑠|, the number of matching tags 𝑚, and the number of transposed tags 𝑡 between query and a sentence. A ⌊︁matching or transposed⌋︁ tag within 𝑞 and 𝑠 counts towards 𝑚 or 𝑡 if it max(|𝑞𝑝𝑜𝑠 |,|𝑠𝑝𝑜𝑠 |) is within the window of 2 − 1. This approach allows for fuzzy matching based on the structure. For the calculation of the Jaro similarity, we use the popular text-distance package11 and pass at the method invocation of the Jaro similarity two lists of part-of-speech tags instead of two strings. Finally, we convert the gained similarity into a distance score by subtracting it from 1, obtaining SD. 𝑆𝐷(𝑞, 𝑠) = 1 − 𝑠𝑖𝑚(𝑞𝑝𝑜𝑠 , 𝑠𝑝𝑜𝑠 ) if 𝑚 = 0 (1) {︃ 0 sim(𝑞𝑝𝑜𝑠 , 𝑠𝑝𝑜𝑠 ) = 1 (︁ 𝑚 )︁ 3 |𝑞𝑝𝑜𝑠 | + 𝑚 |𝑠𝑝𝑜𝑠 | + 𝑚−𝑡 𝑚 otherwise Word mover’s distance In contrast to the other two previous re-ranking approaches that examine whole sentence strings, the word mover’s distance, proposed by Kusner et al.[17], considers the similarity of single words within two sentences to each other. For each word pair, the earth mover’s distance of the corresponding words is calculated using their Word2Vec [18] embeddings. This process is formulated as a combinatorial problem to retrieve the word pairs leading to a minimal cumulative sum of distances of all constructed word pairs. Hence, it accounts for sentences with no words in common but similar meanings due to synonymy [17]. Our re-ranking leverages this behavior to rank sentences different from the query higher to provide a more diverse set of argumentative sentences. Our implementation uses the wmd-relax package12 as it provides an off-the-shelf spaCy hook that uses the same pretrained pipeline as in SD. Using this hook, allows for an easy integration into our re-ranking pipeline. Graph based re-ranking Wachsmuth et al.[19] have proposed a graph-based approach to measure relevance based on structural connections between argument units. Their hypothesis states that the content of arguments does not determine their relevance. The reasoning behind this hypothesis is the subjectivity of the content of an argument. Their proposed approach infers argument relevance from the number of arguments whose conclusions serve as a premise for other arguments. 8 https://universaldependencies.org/u/pos/ 9 https://github.com/explosion/spacy-models/releases/tag/en_core_web_md-3.3.0 10 https://spacy.io/ 11 https://github.com/life4/textdistance 12 https://github.com/src-d/wmd-relax Further, the approach incorporates the intrinsic relevance of those arguments in a recursive fashion. A recursive analysis of links between argument units allows for an objective assessment of argument relevance, as no human judgment is needed. The authors adopt vital components of the PageRank algorithm [20]. They use a framework of argument graphs, in which the arguments represent nodes. Arguments are split into premise and conclusion as argument units. Reusing a conclusion as a premise in another argument determines an edge between two argument nodes. An edge is constructed based on an interpretation function. The authors use an exact string match as an interpretation function. Using our nested structure (multiple conclusions per topic and multiple premises per con- clusion) provided by the initial retrieval step, we model one argument graph for each topic using the networkX [21] graph processing library. For edge interpretation, we reuse the vector embeddings of the initial retrieval and calculate the cosine similarity between each premise and all the other conclusions. If an interpretation threshold of .99 is surpassed, we would create an edge. Analyzing the threshold surpassing similarities over all topic-based argument graphs, we observe a high skew within the similarities (Figure 2, right). The skew can be attributed to the initial retrieval that is also based on the cosine similarity. Regarding the connectivity between arguments, the skewed distribution of cosine similarities leads to few highly connected argument nodes and a majority of nodes with only a single connection to another argument (Appendix 3a). In our system’s setup, an application of a PageRank for arguments would not lead to any meaningful re-ranking scores. Furthermore, we investigate the WMD for graph construction. Using the WMD instead of the cosine similarity should better assess the semantic differences between two argument units. We transform the WMD into a similarity to use it as an interpretation function. We call the transformed measurement word mover’s similarity (WMS). WMS is gained by the following transformation 𝑤𝑚𝑠(𝑠1 , 𝑠2 ) = 1+𝑤𝑚𝑑(𝑠 1 1 ,𝑠2 ) . We assess an initial interpretation threshold of 0.2 that must be surpassed to draw an edge between two arguments. The similarity distribution over all topics differs tremendously from the distribution of cosine similarities (Figure 2, left). Nevertheless, similar to the argument graphs generated using cosine similarity, the node degree distribution is also skewed (Appendix 4b). Next, we examined the total amount of edges for every topic for both interpretation functions (Appendix 3b and 4b). Due to the lower interpretation threshold of the WMS compared to the argument graph construction with the cosine similarity, the number of total edges in the argument graphs is higher. Increasing the threshold would lead to topics for which no WMS would surpass the threshold, thus not generating an argument graph. To alleviate the problem, an individual topic threshold must be tuned, questioning the general applicability of the re-ranking approach. Finally, some edges could attribute the wrong arguments due to our initial deduplication of the provided corpus. The duplicated premises in the provided corpus originated from arguments citing each other within the crawled debate platforms. Our sentence level deduplication is based on exact string matches, and only the first occurrence of a sentence is kept, while the others are discarded. There, we could not enforce that a particular sentence is linked to the argument ID of the original argument, where it was written the first time. Due to the challenges outlined in this section, we did not further investigate a re-ranking based on argument graphs. 1.0 1.000 0.9 0.998 0.8 0.7 0.996 0.6 0.5 0.994 0.4 0.992 0.3 0.2 0.990 1 1 Word Mover's Similarity Cosine Similarity Figure 2: Comparison of the distribution of similarities between Word Mover’s Similarity and Cosine Similarity.Word Mover’s Similarity is gained by rescaling the Word Mover’s Distance. Similarity values were gained by combining the similarities of the constructed argument graphs overall provided sample topics. Argument graph construction using word mover’s similarity used an interpretation threshold of 0.2 to create an edge between two arguments, and the argument graph construction using cosine similarity used a threshold of .99. 3.2.1. Final Scoring Our system calculates the final re-ranking scores for conclusions and premises separately. For WMD and SD, we assess the final score 𝑆 of a premise or conclusion as a weighted sum between the initial cosine similarity 𝐼 and the respective re-ranking score 𝑅 (Equation 2). MMR does not need a weighted sum, as the MMR itself includes the initial re-ranking score information. Like 𝜆 of the MMR, 𝜇 controls the tradeoff between sentence relevance and difference to the query. A higher 𝜇 emphasizes the initial ranking scores and penalizes the re-ranking score. 𝑆 denotes the final score between a query sentence 𝑞 and an indexed sentence 𝑑. We scale both 𝐼 and 𝑅 to the interval of [0, 1]. For each SD and WMD, we examined different parameters of 𝜇, relying on our qualitative assessment of the generated rankings. For our final evaluations, we set 𝜇 to the values of 0.9 for conclusions and 0.75 for premises. These parameter configurations were determined by a heuristic assessment of the relevance and quality of the generated rankings. 𝑆(𝑞,𝑑) = 𝜇 * 𝐼(𝑞, 𝑑) + (1 − 𝜇) * 𝑅(𝑞, 𝑑) (2) 4. Evaluation We performed four runs on the TIRA platform [22] to ensure the reproducibility of our results. Runs are named after the planet Jupiter and the first three Galilean moons. The evaluation foots on the judgments that the task organizers provided. The reported scores adhere to the Approach (TIRA tag) Relevance Quality Coherence Baseline (Jupiter) 0.560 0.725 0.330 SD (Io) 0.588 0.719 0.365 MMR (Europa) 0.546 0.721 0.349 WMD (Ganymede) 0.583 0.776 0.377 Table 1 Resulting mean nDCG@5 for relevance, quality and coherence-based evaluation over 50 topics. Two of our re-ranking approaches could beat the baseline without any re-ranking on relevance. WMD achieves the highest quality score. All runs use a token factor of 1.75 for the initial retrieval. The the runs of SD and WMD use 𝜇 = 0.9 for the re-ranking of conclusions and 𝜇 = 0.75 for the re-ranking of premises. MMR uses a 𝜆 = 0.75. recommendation of calculating nDCG@5. Table 1 shows our relevance, quality, and sentence pair coherence results. To measure the effectiveness of our implemented re-ranking methods, we include a baseline (Jupiter) that relies on the initial retrieval based on the cosine similarities generated by Elasticsearch. Two re-ranking approaches, SD and WMD, could beat the baseline with a mean nDCG@5 of 0.588 and 0.583 for relevance. The run using SD ranks 12th for relevance among all submitted runs of the shared task. SD could not outperform the baseline regarding the quality. WMD showed the best quality measurement, ranking at 8th place among all participating runs. For relevance and quality, the MMR (Europa) experiment did worse in comparison to the baseline but slightly better than SD on quality. Unsurprisingly, none of our runs could achieve high scores regarding the coherence between two sentences, as our retrieval system does not optimize for this criteria. 5. Conclusion We examined whether re-ranking approaches that do not make inferences about argument quality can improve rankings generated by an initial semantic search. In our theory, the initial search maximizes topic relevance, and the argument agnostic re-rankings increase variety, potentially ranking more qualitative sentence pairs of premise and conclusion higher. We have implemented an argument retrieval system using word embeddings for the initial ranking and three argument quality agnostic re-ranking approaches to answer our research question. The re-ranking approaches foot on the maximal marginal relevance, the word mover’s distance, and a novel distance measure based on a fuzzy matching on sentence tags, which we call structural distance. The results show that simple re-ranking approaches could outperform our baseline without re-ranking by a small margin. Our system introduces several parameters. The initial ranking uses a token factor, maximal marginal relevance imposes 𝜆 and structural distance and word mover’s distance use a weighting factor of 𝜇. For the next iteration of Touché, when relevance and quality judgments on a sentence pair level are available, we will perform parameter fine-tuning to improve our outlined approaches in future research. References [1] L. Pummerer, R. Böhm, L. Lilleholt, K. Winter, I. Zettler, K. Sassenberg, Conspiracy Theories and Their Societal Effects During the COVID-19 Pandemic, Social Psychological and Personality Science 13 (2022) 49–59. [2] A. Bondarenko, M. Fröbe, J. Kiesel, S. Syed, T. Gurcke, M. Beloucif, A. Panchenko, C. Bie- mann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2022: Argu- ment Retrieval, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association (CLEF 2022), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2022, p. to appear. [3] Y. Ajjour, H. Wachsmuth, J. Kiesel, M. Potthast, M. Hagen, B. Stein, Data Acquisition for Argument Search: The args.me Corpus, 2019, pp. 48–59. [4] H. Wachsmuth, M. Potthast, K. Al-Khatib, Y. Ajjour, J. Puschmann, J. Qu, J. Dorsch, V. Morari, J. Bevendorff, B. Stein, Building an Argument Search Engine for the Web, in: Proceedings of the 4th Workshop on Argument Mining, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 49–59. [5] C. Stab, J. Daxenberger, C. Stahlhut, T. Miller, B. Schiller, C. Tauchmann, S. Eger, I. Gurevych, ArgumenText: Searching for Arguments in Heterogeneous Sources, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 21–25. [6] C. Zhai, J. Lafferty, A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval, in: ACM SIGIR Forum, volume 51, ACM New York, NY, USA, 2017, pp. 268–276. [7] A. Bondarenko, M. Hagen, M. Potthast, H. Wachsmuth, M. Beloucif, C. Biemann, A. Panchenko, B. Stein, Touché: First Shared Task on Argument Retrieval, in: P. Castells, N. Ferro, J. Jose, J. Magalhães, M. Silva, E. Yilmaz (Eds.), Advances in Information Retrieval. 42nd European Conference on IR Research (ECIR 2020), volume 12036 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2020, pp. 517–523. [8] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument Retrieval, in: K. Candan, B. Ionescu, L. Goeuriot, H. Müller, A. Joly, M. Maistro, F. Piroi, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. 12th International Conference of the CLEF Association (CLEF 2021), volume 12880 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2021, pp. 450–467. [9] R. Agarwal, A. Koniaev, R. Schaefer, Exploring Argument Retrieval for Controversial Questions Using Retrieve and Re-rank Pipelines, in: CEUR Workshop Proceedings, 2021, pp. 2285–2291. [10] K. Ros, C. Edwards, H. Ji, C. X. Zhai, Team Skeletor at Touché 2021: Argument Retrieval and Visualization for Controversial Questions, in: CEUR Workshop Proceedings, 2021, pp. 2441–2454. [11] C. Akiki, M. Fröbe, M. Hagen, M. Potthast, Learning to Rank Arguments with Feature Selection, in: CEUR Workshop Proceedings, 2021, pp. 2292–2301. [12] E. Raimondi, M. Alessio, N. Levorato, A Search Engine System for Touché Argument Retrieval Task to Answer Controversial Questions, in: CEUR Workshop Proceedings, 2021, pp. 2423–2440. [13] A. Mailach, D. Arnold, S. Eysoldt, S. Kleine, Exploring Document Expansion for Argument Retrieval, in: CEUR Workshop Proceedings, 2021, pp. 2417–2422. [14] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, M. Zhou, Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, Advances in Neural Information Processing Systems 33 (2020) 5776–5788. [15] N. Reimers, I. Gurevych, N. Reimers, I. Gurevych, N. Thakur, N. Reimers, J. Daxenberger, I. Gurevych, N. Reimers, I. Gurevych, et al., Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2019, pp. 671–688. [16] J. Carbonell, J. Stewart, The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries, SIGIR Forum (ACM Special Interest Group on Information Retrieval) (1999). [17] M. J. Kusner, Y. Sun, N. I. Kolkin, K. Q. Weinberger, From Word Embeddings to Docu- ment Distances, in: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, JMLR.org, 2015, p. 957–966. [18] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space, arXiv preprint arXiv:1301.3781 (2013). [19] H. Wachsmuth, B. Stein, Y. Ajjour, "PageRank" for Argument Relevance, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 2017, pp. 1117–1127. [20] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web., Technical Report 1999-66, Stanford InfoLab, 1999. [21] A. A. Hagberg, D. A. Schult, P. J. Swart, Exploring Network Structure, Dynamics, and Function using NetworkX, in: G. Varoquaux, T. Vaught, J. Millman (Eds.), Proceedings of the 7th Python in Science Conference, Pasadena, CA USA, 2008, pp. 11 – 15. [22] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, Tira Integrated Research Architecture, in: Information Retrieval Evaluation in a Changing World, Springer, 2019, pp. 123–160. A. Argument Graphs: Edge Interpretation based on Cosine Similarity 105 104 # of Nodes 103 102 101 0 20 40 60 80 100 120 140 160 Degree (a) Node degree histogram over all topics. 4000 3500 3000 2500 # Edges 2000 1500 1000 500 0 50 60 70 80 90 100 Topic number (b) Total count of edges between arguments per topic. 51 52 53 54 55 (c) Example graphs for five topics. B. Argument Graphs: Edge Interpretation based on Word Mover’s Distance 105 104 103 # of Nodes 102 101 100 0 500 1000 1500 2000 Degree (a) Node degree histogram of all topics. 50000 40000 30000 # Edges 20000 10000 0 50 60 70 80 90 100 Topic number (b) Total count of edges between arguments per topic. 51 52 53 54 55 (c) Example graphs for five topics.