Touché Task 2: Comparative Argument Retrieval A document-based search engine for answering comparative questions Notebook for the Touché Lab on Argument Retrieval at CLEF 2021 Daniel Helmrich, Denis Streitmatter, Fionn Fuchs and Maximilian Heykeroth University Leipzig, Germany Abstract While the retrieval of simple facts works very well with modern search engines, the quality of results for comparative questions is rather mediocre. With the goal of developing a retrieval algorithm for find- ing documents containing arguments that help to answer those questions, we tried to evaluate many different approaches to both query expansion and document ranking. Our aim was to build a modular and highly configurable system that provides the necessary tools to help with evaluation-driven ex- perimentation. The results show that we found approaches that were able to outperform the baseline. Especially simple approaches like query term counting have proven to be promising. One of the sim- plest approaches combining query term count and ChatNoir scores improves the NDCG by around 5% w.r.t. the baseline. However, due to the limitations of the provided rankings and the limited time frame further work remains. Keywords information retrieval, comparative argument retrieval, comparative questions 1. Introduction Since the development of the World Wide Web in 1989, it has evolved to the main source of information for a large part of the world’s population. Nearly every arbitrary fact can be searched for using search engines like DuckDuckGo, Bing, or Google, be it "How long is the Nile?" or "How much nutmeg is dangerous?". The retrieval of such facts works pretty well with today’s search engines, but the need for information goes far beyond that. The comparison of the wide range of options for such things as products, brands, or lifestyle choices is just as crucial as the retrieval of simple facts. Of course, there are systems for comparison available, like diffen.com; but while they provide an appropriate comparison for certain topics, their range is quite limited. They are not able to answer questions outside of the few domains they offer (e.g insurance, food, or technology). For users in need of more detailed questions, the remaining option is to consult regular search engines. On those, however, comparative queries like "Which is healthier: Green or Black tea?" or "Which beverage has more calories per glass: beer or cider?" do not perform nearly as well, leaving much room for improvement. In an attempt of CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " dh86hogi@studserv.uni-leipzig.de (D. Helmrich); streitmatter@informatik.uni-leipzig.de (D. Streitmatter); ff87bake@studserv.uni-leipzig.de (F. Fuchs); mh40qyqu@studserv.uni-leipzig.de (M. Heykeroth) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) tackling this issue, in the following paper we present a modular evaluation system1 dedicated to assessing various approaches for comparative argument retrieval. The basis for retrieving documents is the ChatNoir search engine [1]. A synopsis of other approaches tackling this issue can be found in the overview paper [2] of the task. The source code was also submitted to TIRA [3], an evaluation platform for information retrieval tasks, to make further automatic evaluation possible. 2. Related Work Since the same task was already assigned during CLEF 2020, various groups already tried to tackle the issue. The conclusive paper by Bondarenko et al. [4] gives an overview of the imple- mented systems and ideas. Five groups participated in the task, implementing 11 approaches. This gives a great starting point for this task since it notes effective and not-so-successful approaches to follow/ignore. The students’ ideas vary greatly, from using different query expan- sion techniques to implementing various (re-)ranking algorithms to push the ChatNoir results with good arguments to the top. The approach standing out here is the one by Abye et al. [5] yielding the highest NDCG of all the different runs. In their approach, they utilize synonyms and antonyms for the query expansions and extract important comparison features (e.g "better") from the query. To retrieve these synonyms they use the WordNet lexical database and a Word2Vec trained on the English Wikipedia. Four different queries based on these synonyms and antonyms are created for their query expansion approach. After this, they rerank the received documents by using different scores, like the PageRank or the number of comparative sentences in the results. An existing system to answer comparative questions on the general domain is CAM (short for comparative argument machine) [6]. The interface allows the input of the two targeted objects (e.g python and java) and arbitrarily many further comparison aspects (e.g faster). The results are sentences from the Common Crawl2 retrieved via ElasticSearch [7] and then reduced to comparative sentences (either by certain keywords or a machine learning approach) and ranked. The system was evaluated by persons using either CAM or a keywords-based search, measuring the speed and confidence of the answer. The results show that the studies’ participants are faster and more confident using the CAM system. The difference to our task is here that this approach finds argument sentences in documents while our task is finding the best argumentative documents. A tool for detecting arguments in the text is TARGER by Chernodub et al. [8]. It is open- source and publicly available, making argument mining accessible for everyone. It provides an algorithm for tagging arguments in text as well as a retrieval functionality for finding arguments for a given topic. The neural tagger lets the user choose between different models of datasets and word embedding techniques. It returns a JSON file with a list of words each tagged as claim, premise, or not part of an argument. Additionally, the confidence for each rating is returned. In an attempt to improve the query expansion performance of argument retrieval systems, 1 The source code can be found at https://git.informatik.uni-leipzig.de/depressed-spiders/ comparative-argument-retrieval 2 http://commoncrawl.org/ [9] investigated three approaches based on transformers, including query expansion with transformer-based models [10] like GPT-2, BERT, and Google’s Bert-like universal sentence encoder (USE). The GPT-2 based query expansion is the most successful one, increasing the retrieval performance over the baseline by 6.878%. By crawling multiple debate portals like Debatewise, Debatepedia, Debate.org, etc. in 2019, [11] created a corpus called “args.me corpus” that consists of ca. 380.000 arguments from all kind of debates. These arguments were extracted using specific heuristics. 3. Methodological Approach 3.1. Overview Figure 1: A schematic overview over the process To test multiple approaches we implemented a modular system, consisting mainly of a data model and a pipeline changing it. Figure 1 shows the (schematic) change of the data model (light orange squares) and the processing order of the pipeline (blue diamond-shapes), structured into different sections (light blue). The data model is initialized with the query and the first section, Query Expansion, will extend this by adding further, related queries to find relevant documents. The data model, enriched with multiple queries, is then sent to the ChatNoir [1] endpoint for retrieving identifiers of relevant documents, as well as their corresponding full HTML representation in the ChatNoir Requests section. Additionally, the document is cleaned from HTML tags. Now every query (both original and expanded) has a list of result documents. In the next section, Scoring, each of these documents is augmented with arbitrarily many scores, rewarding documents relevant for the comparison with higher scores. In the last step, Reranking, all of these scores are collected and merged into one (allowing the weighting of each score) and ranking the list of result documents accordingly. 3.2. Query Expansion To create a broad mass of queries, we utilize multiple query expansion approaches and combine them differently to maximize the recall value. We include the original query in all requests. Our first query expansion method is called en mass. This approach iteratively replaces one of the entities or features in the query with their most similar word. We retrieve similar words with a Word2Vec model that is provided by the Python library “Gensim” [12]. This model was trained on a Wikipedia corpus from 2014 and a newswire text data corpus called “Gigaword” [13]. Entities are the nouns or brand names that the user wants to compare or retrieve more information about. Features are the words that are used to compare the entities under a given aspect. To extract these features and entities we use the NLP library spaCy [14]. The en mass approach returns as many queries as there are entities and features in the query. The second approach is called Bilbo Baggins. This is the same query expansion method described by Abye et al. from the Touché task in 2020 [5]. Our third query expansion utilizes a masked language model and is called masked model expansion. Similar to the approach by Akiki and Potthast [9], we use a masked language model to replace masked words with a similar (context fitting) word. We use the query and replace all entities and features one by one with a mask token. This method generates as much queries as there are entities and features in the query. The used masked model is the “roberta-base” model [15]. We chose this model since RoBERTa seems like an improved version of BERT [15] which was used by Akiki and Potthast in [9]. We also tried to fine-tune a small BERT model on the args.me corpus [16]. However, this fine-tuned model produced significantly worse results and this approach was therefore discarded due to lack of time. Our last approach is the text generation approach, which utilizes a text generation model to provide more words for a given query. The idea behind this approach is to generate text from the query and utilize the nouns from the generated text as new queries. For this query expansion approach we work with the GPT-2 model [17]. The text is generated with a top-p sampling method, which is one of the recommended sampling methods to generate fluent text with GPT-2 models [18]. To generate the text we use the original query and append the feature word from the original query after the query to start a new sentence. This method can be seen in Table 1, here the feature word “better” is directly appended after the query. All words that are written in italics in the “Generated text” row, are used to generate text with the GPT-2 model. The idea behind this change is to restrict the generated text. Sentences shall generate text that better fits the comparative queries by including the aspect (feature) of the comparative query in the text generation. Original query Which is better, Canon or Nikon? Which is better, Canon or Nikon? Better pictures are taken by Generated text a good camera lens. In comparison [...] Expanded query canon nikon pictures camera lens comparison Table 1 Example: text generation expansion All query expansion methods that request similar words for a given word (bilbo baggins, masked model expansion and en mass) use the two most similar words for a given word (not just the most similar word). This means that the number of created queries for masked model expansion and en mass doubles. Similarly, the text generation expansion generates two texts and therefore expands the original query into two queries. All expanded queries are cleaned after the query expansion process by removing words within the queries that are too similar to each other. 3.3. ChatNoir & Cleaning All queries that were created during the query expansion are subsequently forwarded to the ChatNoir search engine. Mostly, 100 documents are requested, but as shown later, we also experimented with different amounts for this value. The queries to ChatNoir are executed using the ClueWeb12 dataset, which is a large dataset of English web pages collected in 2012 [19]. Almost all documents returned by ChatNoir contain a query-specific score, a page rank, and a spam rank. Furthermore, all returned documents are uniquely identified with a UUID (universally unique identifier). After our application receives the documents from the ChatNoir endpoint, the documents are cleaned by removing HTML tags. No further content extraction or cleaning is done. 3.4. Scoring The goal of the scoring steps in our pipeline architecture is to assign scores to documents respecting their relevance to the query and their argumentative quality. In this section, each approach that we have implemented is described. The resulting scores are used in the "reranking" pipeline step described in 3.1 to sort the documents into their final order of relevance and quality. 3.4.1. Simple Term Count One of the simplest ranking approaches is to assume that documents containing certain terms might be more relevant. In the context of this work, the main idea is that there might be certain words that are used often in argumentative texts and rarely in not argumentative texts. Thereby simply counting occurrences of these words could help to find relevant documents. Examples for such words are "evidence", "that shows", "versus" and "in comparison to". The "Simple Term Count" pipeline step accepts a list of terms as a txt file. For each document, the pipeline step receives it checks for every term in the list if it occurs in the cleaned text of the document. The score assigned to the document is the number of terms from the list that occur in the document’s cleaned text. 3.4.2. Query Term Count Another implemented manual feature-engineering approach is based on the counting of relevant query terms in each document. The main idea is that the relevancy of a document increases with a higher number of words that also appear in the query. It can be expected that the user is especially interested in comparisons that include all the terms he searched for. The score, therefore, rewards documents that contain all of the relevant query terms in an equally distributed manner. The pipeline step described in the following tries to calculate a document score that reflects these thoughts. Its first step is the identification of terms that are deemed relevant for the query. For this, a similar method as for the query expansion is employed, which extracts query terms based on their part-of-speech tags. The terms are always extracted from the original query, even if sub-queries were generated by the query expansion step. In the following, 𝑇 denotes the set of relevant query terms. For each retrieved document, all 𝑡 ∈ 𝑇 are then counted, respectively yielding the number of occurrences 𝑛𝑡 . Based on this, the score is calculated as follows: {︃ 0, ∃𝑡 ∈ 𝑇 |𝑛𝑡 = 0 𝑥doc = ∑︀ 1 𝑡∈𝑇 1 − 𝑏·𝑛𝑡 +1 else The score attains zero if not every term occurs at least once in the document. Otherwise, the shown sum is calculated, whose summands converge to 1 for increasing 𝑛𝑡 . This has the effect of not favoring documents only due to the excessive existence of some terms and it makes the score resistant to the length of a document. The score is higher for documents where the query term occurrences are uniformly distributed than for such where only one or a few terms occur very often. The factor 𝑏 (with 𝑏 ≥ 0) is another parameter that determines how big the influence of increasing values of 𝑛𝑡 is: Greater 𝑏 yields a curve that is steeper for small 𝑛𝑡 , and will more early be close to 1. A smaller 𝑏 on the other hand will make the score not grow as fast for increasing 𝑛𝑡 . 3.4.3. Classifier To measure the argumentative quality of texts we trained a text classifier and built a pipeline step for assigning scores using Tensorflow text classifiers. The classifier is trained on a combined dataset which contains documents from the "askreddit" Q&A forum on the social media platform Reddit and additionally documents from the IBM Debater dataset[20]. On the "askreddit" forum, posts can be tagged with the flair serious. These posts are strictly moderated and only serious comments are allowed. Jokes, puns, and inappropriate comments are supposed to be deleted. The goal of the classifier during the training phase is to distinguish between posts from two pools. The first pool contains comments from normal askreddit posts. The second pool contains comments from serious askreddit posts and additional arguments from the IBM Debater dataset. This pool should be recognized as arguments of high quality. The comments and their corresponding question posts are mapped to vectors by calculating the mean of their content’s word2vec values. For examining the validness of these resulting vectors, we compared the serious and normal comments with their respective post. By reducing the vectors with PCA, we could verify that the vectors for serious comments are closer to their corresponding post than the comments from the normal category. This supports the hypothesis that serious comments are thematically closer to the question they are answering. Therefore, a classification based on word vectors can be a valid approach. Nevertheless, due to the limited time frame, we decided to use a classifier that only takes the comment as an input disregarding the original post the comment is attached to. The classifier we trained is a Tensorflow model, utilizing text vectorization and one-dimensional convolutional layers. It is trained with the binary-cross-entropy loss function. The dataset is split into training and validation data at an 80%/20% ratio. On the validation set, the classifier achieves an accuracy of 83%. State-of-the-art argument classifiers also achieve accuracy of roughly 80% [21]. It has to be noted that the difference in the datasets is quite substantial and due to the unique properties of the Reddit dataset, the trained classifier might work on a simpler domain than other argument classifiers. 3.4.4. Targer For another approach of measuring the "argumentativeness" of a document, the neural argument miner TARGER is used. For this every document’s cleaned text is sent to the public available API3 . It returns a list of words, either tagged as claim, premise, or not part of an argument with the confidence of said tag. The assumption for this approach is the more of the text is an argument (or part of one) the better. So we try to measure the argumentative density by calculating the value 𝑥doc : 𝑡𝑎𝑔𝑠 𝑥doc = 𝑡𝑎𝑔𝑠𝑎𝑟𝑔 𝑎𝑙𝑙 While 𝑡𝑎𝑔𝑠𝑎𝑟𝑔 is the number of words tagged as an argument, 𝑡𝑎𝑔𝑠𝑎𝑙𝑙 denotes all tags (words) in the document. It is configurable what counts as an 𝑡𝑎𝑔𝑎𝑟𝑔 (claim, premise, or both) and what minimal confidence a tag needs to be counted as an argument. There are multiple models available at the public TARGER API. The model mainly used here4 yields argument/non-argument useful ratios roughly between 10% and 80%. 3.5. Reranking The reranking step of the pipeline sorts all documents by their assigned scores. Each score value is normalized before sorting. A configurable weighting of each scoring method is possible. To sort each document a final, normalized and weighted score is assigned using the following formula where 𝑆 is a list of scores and 𝑤 is a configured weight for the scoring method: 𝛽 scoredoc = 𝛽∈𝑆 𝛽max ∑︀ ×𝑤 4. Evaluation & Results 4.1. Evaluation Method To gain valuable insights into the characteristics of our implemented pipeline steps, we created a baseline configuration to compare them with. It only uses ChatNoir and does not employ any modifications, neither on the original query nor on the ranking that is returned by it. This also corresponds with the baseline utilized in [4], whose results we could thereby reproduce. Following the idea of an evaluation-driven development, our system allows us to test and fine-tune various approaches. By comparing them with the baseline, we could quickly decide 3 https://demo.webis.de/targer-api/apidocs/ 4 Essays model, fasttext embeddings. See https://demo.webis.de/targer-api/apidocs/#/Argument%20tagging/ post_targer_api_tag_essays_fasttext which parameters work better than others. The evaluation of every configuration was based on the topics and judged documents from the Touché Lab 2020 [4]. For determining the ideal values for the reranking weights, a simple hyperparameter search is used. While, in theory, each part of the pipeline could benefit from this kind of optimization method, the reranking step is particularly suited for it: The serialization of the internally used data model allows re-running this step multiple times, as the reranking itself runs in a significantly shorter time than the other pipeline steps. Therefore a grid search is implemented, which simply iterates over a given set of combinations for the reranking weights. For a given pipeline configuration, this allows determining weight values that maximize the NDCG@5. 4.2. Results 4.2.1. Query Expansion An overview of our query expansion methods and their f-score values is shown in Figure 2. The f-scores were created by requesting 100 documents per query from ChatNoir. Figure 2: F-score comparison of query expansion methods The query expansion mode “0 baseline” in Figure 2 represents the f-score of the default query with 100 requested documents. Since expansion methods like en mass generate up to 8 queries, which result in ca. 800 documents that are retrieved, we also include “1 baseline_500_docs” and “2 baseline_800_docs”, which request 500 respectively 800 documents for the original query. This allows us to better compare our expansion methods with the baseline, since the different query expansion modes create different amounts of queries and therefore retrieve different amounts of documents. We see that with more documents, the f-score decreases. This does not necessarily mean that the results are not relevant, but could also be an indicator that the ratio of evaluated to un-evaluated documents decreases with more documents. Therefore, we included the baseline query with more than 100 documents to better compare the f-score with our other approaches to query expansion that also retrieve more than 100 documents - since the en mass approach retrieves ca. 800 documents. It is observable that all query expansion methods perform better than the baseline with 500 and 800 documents. None of these approaches is better than the baseline with 100 documents. This is probably due to the fact, that the document judgments were done over the baseline query. To further increase the number of retrieved documents and therefore increasing the number of possibly relevant documents, we tried to combine multiple query expansion modes. The combination of all expansion methods and requesting 100 documents per query, results in the worst F-score with a score of ca. 0.03. This is why we decided to only combine three query expansion approaches. The query expansion approaches with the highest f-score are a combination of bilbo baggins, en mass and text generation and also a combination of bilbo baggins, en mass and masked model expansion. These approaches are shown in Figure 2 as approach 7 and 10. The main difference between these two approaches is the deterministic state of the query. While approach 7 is deterministic, the usage of a GPT-based model makes the generated query nondeterministic. 4.2.2. Reranking We tested the reranking approaches on the standard baseline with 100 documents (blue bar) from ChatNoir to measure which algorithm would yield the best NDCGs. In Figure 3 the comparison of different implementations is shown, namely TARGER, the Tensorflow classifier, the simple term counting, and the counting of relevant query terms. The ES in the TARGER label stands for the used Essays model. However, even though some topics got a better ranking with the usage of TARGER, the hyper-parameter search revealed for the maximization of the NDCG that it is best set to 0, meaning it has no positive influence on the average NDCG with any weighting. Also, the classifier based on the IBM/Reddit dataset seems to be unable to improve the NDCG of the baseline retrieval. The simple term count doesn’t seem to add much to the NDCG, positive or negative. A well-performing approach is counting the relevant query terms in each document. Fur- thermore, it can be seen that the combination of the simple term count and the query term count seems to be slightly better than the latter alone, but in the result displayed in Figure 3 the simple term count only has a weighting of 0.05, meaning that it has nearly no influence. 4.2.3. Combined Approaches We tried bringing the best of our approaches together, using the query term counting reranking approach on the results of the query expansion, which yielded the results in Figure 4. The bars annotated with _ignore are evaluated with the ignore-option. This option ignores all unjudged documents when calculating the NDCG, meaning that we do not assume unjudged documents as irrelevant, which is normally the case. We did not have a lot of judgements for Figure 3: Average NDCG@X of the different reranking approaches (on the baseline) the documents, most of the documents we retrieved were unjudged. Since a lot of unjudged documents in ndcg@5/10/.. could either mean that our reranking methods perform bad, or that we found lots of new potential relevant but unjugded documents. So we wanted to ignore all unjudged documents, to only calculate the NDCG over judged documents. This allows us to better compare the ndcg values of the reranking approaches among each other. The best retrieval method combined with the query expansion (combined_masked_bilbo_gpt) yields no better NDCG than simply querying the baseline. Our second combination approach is utilizing the query expansion combined_enmass_bilbo_masked, which has the highest recall. The results of this approach are then annotated with reranking scores and the hyper- parameter search in then run on it. It revealed that the scores of TARGER and the count of simple terms do not contribute to a better NDCG (best NDCG with a weighting of 0). While the classifier does rank relevant documents higher (weight: 1.1), but not as well as the query term count (weight: 2.5). The precision and the F-score of the approaches with the ignore-option are higher than the ones of the baseline. Since the recall of our query expansion is not as high as the baseline_800, this means that the query expansion retrieves more unjudged documents. We decided to submit multiple runs to TIRA, but not all combinations of our approaches, only the one we deemed worth. The submitted runs to TIRA are: run_1: using combined_masked_bilbo_gpt query expansion and query term counts reranking run_2: the counts_combined reranking (combining both methods that count terms in the docu- ments) run_3: using combined_enmass_bilbo_masked query expansion and a optimized weighting on the reranking Figure 4: Average metrics of all topics with the combined query expansion and best reranking approach (qe_best_rr) compared to the baseline (baseline_800) and the best recall with query expansion and a hyperparameter search on the weighting of all reranking scores (best_recall_hyper_rr) 5. Discussion & Conclusion In this project, we tried to evaluate many different approaches to both query expansion and ranking documents. Our aim was to build a modular and highly configurable system that provides the necessary tools to help with evaluation-driven experimentation. The results show that we found approaches that were able to outperform the baseline. Especially simple approaches like query term counting have proven to be promising. But there are certain limitations to our work. Due to the focus on many approaches, we could not investigate them in depth. Focusing on a small number of approaches might help in improving our research insights further. The quality of our evaluation is also limited by the provided judgments we compared to. They only contain a very limited amount of documents per topic. In the future, multiple steps could be taken to further improve our research. Main content extraction could be applied to the documents and might improve the performance of the scoring pipeline steps by removing ads and irrelevant content. Our TARGER implementation can also be improved by applying a more specific algorithm than just counting words to the arguments of the text. Additionally due to the limited amount of provided queries our hyperparameter search might lead to overfitting on the domain. This might need to be mitigated in the future. The query expansion could be improved by working more on the entity and feature extraction since some expanded queries suffer from falsely extracted entities and features. To further improve the F-score, one could experiment more with the parameters of the specific expansion modes, like the number of returned queries for each approach. A promising approach to reranking seems to be the counting of relevant query terms in the result document itself, improving the NDCG w.r.t the baseline by around 5% (NDCG@5). It can be assumed that the underlying formula (cf. Section 3.4.2) favors texts which focus on the comparison of two or more entities. However, it does not include any qualitative measurement of the included arguments. This leaves room for further improvement: It could be imagined to combine the resulting score with another score rewarding good arguments in the text. Lastly, the implemented reranking approach imposes certain limitations, as it is only a linear combination of the calculated document scores. Combining them in a more complex way could improve the performance as well. To conclude, some approaches could be found that improve the retrieval performance in comparison to the baseline. The rather simple approach of counting relevant query terms in each document, combined with ChatNoir scores seems to be promising. The provided evaluation system might be used in the future to further incorporate and assess the viability of more complex approaches. References [1] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl, in: L. Azzopardi, A. Hanbury, G. Pasi, B. Piwowarski (Eds.), Advances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2018. [2] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument Retrieval, in: D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego, M. Potthast, F. Sebastiani (Eds.), Advances in Information Retrieval. 43rd European Con- ference on IR Research (ECIR 2021), volume 12036 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2021, pp. 574–582. URL: https://link.springer.com/ chapter/10.1007/978-3-030-72240-1_67. doi:10.1007/978-3-030-72240-1\_67. [3] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA integrated research architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF, volume 41 of The Information Retrieval Series, Springer, 2019, pp. 123–160. URL: https://doi.org/10.1007/978-3-030-22948-1_5. doi:10. 1007/978-3-030-22948-1\_5. [4] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument Retrieval, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes Papers of the CLEF 2020 Evaluation Labs, volume 2696 of CEUR Workshop Proceedings, 2020. URL: http://ceur-ws.org/Vol-2696/. [5] T. Abye, T. Sager, A. J. Triebel, An open-domain web search engine for answering com- parative questions, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/paper_130.pdf. [6] M. Schildwächter, A. Bondarenko, J. Zenker, M. Hagen, C. Biemann, A. Panchenko, An- swering comparative questions: Better than ten-blue-links?, in: L. Azzopardi, M. Halvey, I. Ruthven, H. Joho, V. Murdock, P. Qvarfordt (Eds.), Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, CHIIR 2019, Glasgow, Scotland, UK, March 10-14, 2019, ACM, 2019, pp. 361–365. URL: https://doi.org/10.1145/3295750.3298916. doi:10.1145/3295750.3298916. [7] C. Gormley, Z. Tong, Elasticsearch: the definitive guide: a distributed real-time search and analytics engine, " O’Reilly Media, Inc.", 2015. [8] A. N. Chernodub, O. Oliynyk, P. Heidenreich, A. Bondarenko, M. Hagen, C. Biemann, A. Panchenko, TARGER: neural argument mining at your fingertips, in: M. R. Costa- jussà, E. Alfonseca (Eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 3: System Demonstrations, Association for Computational Linguistics, 2019, pp. 195–200. URL: https://doi.org/10.18653/v1/p19-3031. doi:10.18653/v1/p19-3031. [9] C. Akiki, M. Potthast, Exploring argument retrieval with transformers, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/ paper_241.pdf. [10] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Brew, Huggingface’s transformers: State-of-the-art natural language processing, CoRR abs/1910.03771 (2019). URL: http://arxiv.org/abs/1910.03771. arXiv:1910.03771. [11] Y. Ajjour, H. Wachsmuth, J. Kiesel, M. Potthast, M. Hagen, B. Stein, args.me corpus, 2020. URL: https://doi.org/10.5281/zenodo.4139439. doi:10.5281/zenodo.4139439. [12] R. Řehůřek, P. Sojka, Software Framework for Topic Modelling with Large Corpora, in: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, 2010, pp. 45–50. http://is.muni.cz/publication/884893/en. [13] R. Parker, D. Graff, J. Kong, K. Chen, K. Maeda, English Gigaword Fifth Edition, Philadelphia: Linguistic Data Consortium, 2011. doi:10.35111/wk4f-qt80. [14] M. Honnibal, I. Montani, spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing, 2017. To appear. [15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy- anov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692. [16] H. Wachsmuth, M. Potthast, K. Al-Khatib, Y. Ajjour, J. Puschmann, J. Qu, J. Dorsch, V. Morari, J. Bevendorff, B. Stein, Building an argument search engine for the web, in: Proceedings of the 4th Workshop on Argument Mining, Association for Computational Lin- guistics, Copenhagen, Denmark, 2017, pp. 49–59. URL: https://www.aclweb.org/anthology/ W17-5106. doi:10.18653/v1/W17-5106. [17] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised multitask learners (2019). [18] P. von Platen, How to generate text: using different decoding methods for language generation with transformers, 2020. URL: https://huggingface.co/blog/how-to-generate, accessed: 26.02.2021. [19] J. Callan, M. Hoy, C. Yoo, L. Zhao, Clueweb12 web dataset, 2012. [20] IBM, Ibm debater dataset, 2014. URL: https://www.research.ibm.com/haifa/dept/vst/ debating_data.shtml. [21] A. Toledo, S. Gretz, E. Cohen-Karlik, R. Friedman, E. Venezian, D. Lahav, M. Jacovi, R. Aharonov, N. Slonim, Automatic argument quality assessment – new datasets and methods, 2019. arXiv:1909.01007.