=Paper=
{{Paper
|id=Vol-3180/paper-263
|storemode=property
|title=Touché - Task 1 - Team Korg: Finding pairs of argumentative sentences using embeddings
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-263.pdf
|volume=Vol-3180
|authors=Cuong Vo Ta,Florian Reiner,Immanuel von Detten,Fabian Stöhr
|dblpUrl=https://dblp.org/rec/conf/clef/TaRDS22
}}
==Touché - Task 1 - Team Korg: Finding pairs of argumentative sentences using embeddings==
Touché - Task 1 - Team Korg: Finding pairs of argumentative sentences using embeddings Notebook for the Touché Lab on Argument Retrieval at CLEF 2022 Cuong Vo Ta1 , Florian Reiner1 , Immanuel von Detten1 and Fabian Stöhr1 1 Leipzig University, Augustusplatz 10 Leipzig 04109 Germany https:// www.uni-leipzig.de/ Text Mining and Retrieval Group TEMIR, Leipzig University, Leipzig Germany https:// temir.org/ Abstract This notebook outlines the experiments and results for Task 1 of the Touché Lab on Argument Retrieval at CLEF 2022 by Team Korg. ElasticSearch serves as our baseline to index the args.me corpus, extended by preprocessing steps of filtering, stemming, a custom stop-word list and WordNet-based synonyms to enrich documents. We approach the problem of finding coherent pairs of argumentative sentences by using and comparing two embedding methods, namely Doc2Vec and Sentence BERT for semantic search. In our first iteration, we use a custom-trained model for Doc2Vec and the out-of-the-box functionality of SBERT for semantic search. To refine the retrieval of meaningful sentence pairs, we incorporate the text generation functionality of GPT-2 to generate prompts as an input for the sentence embeddings. After evaluating those approaches with the Normalized Discounted Cumulative Gain and using an annotated dataset of Touché 2021, we identify Doc2Vec without text generation and a revised algorithm to match sentence pairs as our best performing approach for retrieval of argumentative sentences. Keywords Information retrieval, Argument retrieval, Touché Task 1, CLEF 2022, Semantic search, Doc2Vec, Sentence BERT, GPT-2, Text generation, Transformer, 1. Introduction Today, an ever-increasing pace of news coverage can be observed, while opinions regarding controversial topics are becoming more polarized. It is important to be exposed to different views in order to form your own opinion. While it is possible to search the World Wide Web for virtually every topic, it can be quite difficult to find arguments with relevant sources, which portray different point of views. Understanding the reasoning behind two opposing points of views can be facilitated by information retrieval systems. Wachsmuth et al. [1] built an argument search engine which relies on a openly accessible index of nearly 300k web scraped arguments from different debate portals. The Touché Lab revolves around information retrieval for the arguments of the args.me corpus [2]. This years Touché Task 1 [3], held at CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ cv86jahu@studserv.uni-leipzig.de (C. Vo Ta); mai12mju@studserv.uni-leipzig.de (F. Reiner); i.vondetten@gmail.com (I. von Detten); fs72popy@studserv.uni-leipzig.de (F. Stöhr) 0000-0003-4528-5591 (I. von Detten) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) the annual CLEF conference [4], is about retrieving a pro and a con argument from the cor- pus and forming strong argumentative sentence pairs for each of them. Our team name is "Korg". In this paper we outline the pipeline for our argument retrieval system and the two different models of sentence embeddings we use to fulfill the task. Our argument retrieval model is based on the outcome of the last year’s Touché 2021 [5], more precisely on the results of Team Elrond. Following their approach, we use ElasticSearch for our index and retrieval with the DirichletLM similarity, described in Section 3.1 [6]. In order to find a pair of fitting sentences for the retrieved result, we compare two semantic search approaches. In Section 4.2 we identify Doc2Vec [7] and Sentence-BERT (SBERT) [8] as promising models using sentence embedding. We find a relevant sentence and match it with another sentence to form a coherent pair by embedding the sentences and computing the cosine similarity of the two sentences. Sentence embeddings are designed to find sentences with the highest similarity, therefore we have to counteract the problem of retrieving sentences with identical meaning. Therefore we develop a second approach in Section 4.3 with the text generation function from GPT-2 [9]. In this approach, we take the first retrieved sentence and give it to GPT-2 as an opening text (called prompt). Then the model generates a subsequent text output which gets passed to Doc2Vec and SBERT. With that, these models find a similar sentence in their embeddings to the generated output and form a sentence pair with the first sentence. In Section 5 we evaluate our different experiments and decide on which pipeline we use for our submission for this years Touché lab. 2. Related Work The system described in this paper is intended to run on the TIRA platform [10, 11]. TIRA is an online platform where researches can upload and evaluate their proposed solutions for shared tasks. Shared tasks are an increasingly popular form of solving open problems in academia. Shared tasks are often conducted in the scope of conferences, especially in the field of natural language processing or machine learning1 . The organizers usually contribute large datasets which can be used by the teams of researchers to tackle diverse problems. For the shared task of the Touché Lab, TIRA ensures that all solutions are running with the same data and produce reproducible results. With web search engines it can be harder to get an overview about different opinions on a topic than to get the answer for a clearly stated question [12]. The field of computational argumentation deals with this and other problems by mining arguments from unstructured text and computational representation of arguments. It is essential for improving search results involving arguments [13] [14]. There are also approaches to create a World Wide Argument Web based on structured arguments in an Argument Interchange Format (AIF) [15]. In this paper we focus on the results of previous teams participating in the Touché Lab, especially of Task 1 of last years shared task [5]. The task was to retrieve arguments for “controversial questions from a focused collection of debates to support opinion formation on topics of social importance” [5]. 1 for examples of shared tasks on international conferences, see https://www.statmt.org/wmt21/, https://pan. webis.de/clef19/pan19-web/ and https://alt.qcri.org/semeval2019/index.php?id=tasks This years task is extending on this and asks the participating teams to additionally retrieve a pair of argumentative sentences for positive and negative stances of an argument. Solutions from Touché 2021 covered a wide range of approaches and performance for the task. To retrieve arguments, the retrieval model based on DirichletLM proved to perform better than other models [16]. Multiple teams worked on evaluating the impact of preprocessing by means of query expansion, word stemming and WordNet-based synonyms. The approaches involved different teams choosing semantic search as a method to retrieve relevant results. Preprocessing proved to be a important step to increase relevance and quality of the search results. Semantic search delivered mixed results, which were influenced by the choice of model and training data. To compare sentences from arguments for the sentence pair retrieval two technologies are used in this paper: Sentence BERT (SBERT) [8] and Doc2Vec [7]. SBERT is a modification of the BERT network, a neural network designed for natural language processing (NLP) tasks like language modeling and next sentence prediction [17]. SBERT has a better performance on sentence-pair regression tasks and can for example be used to find similar sentence pairs in a huge collection of sentences. Doc2Vec is based on Word2Vec, a neural network with word embeddings in vector space, where vectors are representing words. The embedding is designed to solve problems like calculating the similarity of words. This is usually done by measuring the cosine similarity of the respective vectors [18, 19]. Doc2Vec extends the vector representation from words to documents of arbitrary size. The generative pre-trained transformer (GPT-2) is a language model developed by OpenAI 2 which uses a deep neural network to perform several text-related tasks like translation or text summarization [9]. In this paper GPT-2 is used to generate follow-up sentences based on sentences from arguments. 3. Methodological Approach In the following section, the methods we use will be described briefly. To provide context, we introduce our pipeline and present the concept of semantic search and sentence embeddings. We rely on these concepts to match sentences which should represent an argument in its entirety and should form a coherent pair at the same time. Then, we will cover the technologies we use in detail: ElasticSearch, Doc2Vec, SBERT and GPT-2. 3.1. Indexing We use ElasticSearch to create an index of all arguments from the args.me corpus. In order to improve the out-of-the-box functionality of ElasticSearch we implement a preprocessing pipeline before building the index. For optimal results, we follow the approach from group Elrond from Touché 2021. They performed well in relevance and quality scores, and were placed among the best in both categories. They used a combination of several preprocessing steps, which are: filters, namely Asciifolding and Lowercase, and stemming with the Krovetz algorithm [5, 20]. Then we add an additional step and remove stop words with a custom stop 2 https://openai.com/ word list. As the last step in our pipeline we follow the approach of Team Elrond again and enrich our documents with WordNet-based synonyms [21, 5]. 3.2. Retrieval To retrieve arguments based on the given query, we also use ElasticSearch. We use the LM Dirichlet similarity to score matching documents because previous contributions from Touché which used ElasticSearch reported best results with LM Dirichlet similarity [22, 5, 23]. Figure 1 provides an overview of the pipeline of our retrieval system up to this point: The args.me corpus serves as a base for indexing and we search for relevant arguments using ElasticSearch. We use the top results as a starting point to find sentences which form a coherent pair. For this, we experiment with different methods which we introduce in the following sections. Figure 1: The setup of the retrieval system to retrieve relevant arguments up to a point where an initial sentence can be used to find a pair of argumentative sentences. 3.3. Semantic Search Semantic search describes the process of finding meaning in a text. This meaning could refer to different parts of the search process, like understanding the query, the data or representing knowledge in a suitable way for retrieval [24]. So instead of the more traditional search engines which try to find documents based on lexical matches, semantic search can also find synonyms [25]. Figure 2: A schematic illustration of a vector space in two dimensions. The query of a search, marked as the orange dot, is embedded and relevant documents are closer than irrelevant documents. Figure adapted from [25] The basic idea is that known data is embedded into a vector space, as seen in Figure 2. A text of arbitrary length is represented in the vector space. In the same way, a search query can be embedded into the same vector space and a vector can be inferred for the new data. It is possible to find the closest entry by calculating the euclidean distance between arbitrary vectors. This entry should then have a high semantic overlap with the query [25]. We use semantic search to compare the sentences we want to pair with the goal to find the most similar sentences. Next, we describe the two methods we used to retrieve sentence pairs: Doc2Vec and Sentence BERT. The two methods are based on the aforementioned concept of embedding documents of different length, in our case sentences, into a vector space and finding documents which are similar in their semantics. 3.3.1. Doc2Vec Doc2Vec is an unsupervised machine learning model that represents documents by a dense vector. It was first introduced in 2014 from Mikolov and Le and builds upon Word2Vec [26, 7]. Word2Vec assumes that words can be embedded into the vector space and the resulting vectors can be used to measure semantic similarity. This understanding of semantic similarity or meaning is based on the bag-of-word model. Doc2Vec aims to overcome the major weaknesses of bag-of-word models: Assuming a sentence can be represented by their respective bag of words implies that the meaning is independent from the word order. Therefore, when using Word2Vec, the word ordering within the respective document is lost. Additionally, the context and semantics of the words are usually ignored and semantic features of sentences like negation or irony or cannot be taken into account [26]. To retrieve argumentative sentences for positive and negative stances, it is needed to represent this context. Using Word2Vec leads to a scenario where different sentences could have the same representation as long as those sentences contain the same words. To solve this issue and take the context of the sentence as well as the word order into account, Mikilov and Le add a new vector to a document of variable length. The document vectors serve as information about the context of a word when using the embedding. It is combined with the respective word vectors of the document by averaging or concatenating. The Paragraph id, as seen in Figure 3, serves as a representation of the context for the paragraph or document, respectively. Figure 3: Doc2Vec’s framework for learning paragraph vectors. The Paragraph id represents the context of a paragraph and is used as additional information to calculate embeddings for single words. Figure adapted from [26] This document vector D is then used for the training of the word vectors W and holds the document representation. The authors report improvement for information retrieval and classification over other methods based on word embeddings and bigram embeddings alone. We choose Doc2Vec to retrieve coherent sentence pairs because the meaning of documents of variable length, in our case sentences, is represented and takes the sentence as a whole into account instead of every word, independent from the context. Therefore the similarity score produced by Doc2Vec promises to result in sentences which are similar in meaning rather than semantic similarity only. 3.3.2. Sentence BERT BERT is a language representation model that was first introduced in 2018. For creating this language model, BERT uses deep bidirectional representations from unlabeled text. Bidirectional indicates that the model predicts a word based on what words it precedes and follows. With this approach, BERT managed to achieve success on several state-of-the-art tasks. This includes tasks like question answering and language inference [8]. For semantic similarity search, on the other hand, BERT is not so well suited. This is due to BERTs architecture, which comes with a high computational overhead. To overcome this bottleneck, Sentence-BERT (SBERT) was introduced in 2019 [25]. SBERT is a modification of the existing pretrained BERT network, which allows us “to derive semantically meaningful sentence embeddings” [25]. This is achieved by adding a pooling operation to the output of BERT and results in a fixed sized sentence embedding. Furthermore, siamese and triplet networks are created to fine-tune BERT and ensure that the produced embeddings are “semantically meaningful and can be compared with cosine-similarity” [27, 25]. SBERT is easily accessible over a python framework, which is based on PyTorch and Trans- formers [28, 29]. The framework allows the use of different pretrained models to train sentence embeddings on over 100 languages [25]. The model we use (all-MiniLM-L12-v2) is based on Microsoft’s MiniLM and finetuned on more than 1 billion sentence pairs [30, 25]. The authors describe it as a general purpose model. This was a reason for us to not fine-tune the model any further for the training of our sentence embeddings. 3.4. GPT-2 GPT-2 is the abbreviation for Generative Pre-trained Transformer 2 which is an unsupervised transformer language model created by OpenAI in 2019. It is most used to translate texts, answer questions, summarize passages and to generate text output [31]. OpenAI web scraped 45 million outbound links from the social media platform Reddit to gather training data for the GPT-2 model. In order to assess the quality of a shared URL they used links which had at least 3 upvotes from the community. The final corpus consists of slightly over 8 million documents for a total of 40 GB of text [9]. The architecture of GPT-2 is based on Transformer [29]. OpenAI released four different model sizes of GPT-2 for free use which are compared in Table 1 [9]. Name Parameters Layers 𝑑𝑚𝑜𝑑𝑒𝑙 SMALL 117M 12 768 MEDIUM 345M 24 1024 LARGE 762M 36 1280 EXTRA LARGE 1542M 48 1600 Table 1 Architecture hyperparameters for the publicly available four model sizes of GPT-2 [9] We use GTP-2 to generate follow-up sentences for the initially retrieved sentences from the corpus with the goal of finding more coherent sentence pairs. This leads to the pipeline displayed in Figure 4, where we use the generated sentence from GTP-2 as input to retrieve a similar sentence with Doc2Vec and SBERT. Figure 4: Continuation of the pipeline including text generation to find a fitting second argument. 4. Experiments In the following sections we describe our approach in more detail and explain our experiments. We outline the pipeline for our retrieval system, as well as the setup and preprocessing steps. This includes a detailed description of the data we rely on and our steps to create an index. ElasticSearch serves as a baseline for indexing and retrieval of arguments. Then we describe our process of developing our pipeline with Doc2Vec, SBERT and text generation with GPT-2 from a prototype to a more refined system for retrieving argumentative sentences. In the next section we evaluate the results of the different approaches. 4.1. Setup: Dataset, Indexing and Argument Retrieval For our experiments, we used the args.me corpus [2]. This corpus consists of 387 606 arguments. Each argument consists of an id, conclusion, premises (modeled as stance and text), context and sentences (modeled as sourceTitle, sourceId, nextArgumentInSourceId, sourceUrl, discussionTitle, previousArgumentInSourceId, acquisitionTime). The arguments were crawled from four different platforms, namely: Debatewise, IDebate.org, Debatepeia and Debate.org. As described in Section 3.1 and 3.2, we use ElasticSearch for indexing and argument retrieval. We work with the configuration used by Team Elrond from the previous Touché task 1 [5]. With this setup we can query ElasticSearch to retrieve relevant arguments. Next, we present our approaches to pair fitting sentences. 4.2. Approach 1: Semantic Search using Doc2Vec and a matching algorithm With this approach, the conclusion and each sentence of a retrieved argument is embedded by Doc2Vec. The Doc2Vec model was trained on the corpus of args.me [2]. We choose one sentence as a document in the context of Doc2Vec and train the model on all sentences of the corpus. We preprocess the sentences with two filters: First, we remove all sentences with less than three words, because they carry very little to no meaning. Then, we remove all sentences containing a hyperlink, because in most cases these sentences refer to a source to support their argument, but carry no meaning in themselves. For matching two sentences, we first experiment with a naive approach to look for the most similar sentence to the conclusion in the whole corpus. The conclusion is inferred as a vector in our Doc2Vec model and the most similar sentence from the corpus is located by computing the cosine similarity of the inferred vector. The quality of the matches is varying immensely and the two sentences only supported each other in a few cases, because the sentences from the whole corpus are retrieved and are often seemingly similar, but covering very different topics. Therefore we only look for matching sentences within the sentences of an argument. To get the most meaningful sentences and to ensure the sentences are forming a coherent pair, we develop the following matching algorithm: All sentences of an argument are matched with each other and the cosine similarity of the respective vectors is calculated. Then, we calculate the cosine similarity for each sentence to the conclusion. Here, our reasoning to compare each sentence to the conclusion is the following: We assume the conclusion carries the summarized meaning of the argument, therefore we want to retrieve sentences which are most similar to the conclusion and carry the most meaning or the most important part of the argument, respectively. In a last step, we average the distance between the two sentences and the distance of each of the two sentences to the conclusion. Then the sentence pair with the highest averaged cosine similarity forms our top argumentative sentence pair. 4.3. Approach 2: Text-Generation As an alternative approach we use GPT-2’s text generation functionality to find a second sentence. We use the sentence from the results of ElasticSearch as input for GPT-2. It generates a subsequent text output. This output is used as input for semantic search in our Doc2Vec and SBERT models. In the end the most similar sentence to the generated output will be matched with the initial sentence and form a sentence pair. The idea is that GPT-2 can generate a sentence which is a coherent follow-up to the first sentence and that our two models can find the most similar sentence to the generated one. Just using the first argument as an opening text (referred to as prompt) leads to mixed quality of the generated texts. In order to stabilize the quality we follow the approach of Akiki and Potthast from the Touché Lab 2020 [23]. They used the text generation functionality for a query expansion to achieve better retrieval results. By embedding the query in an argumentative structure, as seen in Table 2, they steered the language model into a output which resembles a opinion more closely than a simple statement [32]. To represent arguments for the same topic, but with opposing stances, they created positive, negative and neutral prompts as shown in Table 2. This stabilizes the text generation and leads to results which carry a stronger pro or con sentiment. Using the same prompts as Akiki and Potthast, we insert the conclusion from the respective argument into the outlined argumentative structure seen in Table 2 and use the modified prompt as input for GPT-2. Stance Prompt - What do you think?Positive - Yes because .. - What do you think? - The answer is yes ... - What do you think? Negative - No because ... - What do you think? - The answer is no ... - What do you think? Neutral - I don’t know ... - What do you think? - Not sure ... Table 2 Prompts for different stances to get a variety of argumentative output. The pattern of a question, the argument and an affirmative or dissenting beginning of a sentence lead to texts which closer to opinions in an arguments than factual statements. Further the choice of the decoding method of GTP-2, which decodes the representation from the model into text, and their respective parameters are important for the outcome of the text output. These methods decide how incoherent, repetitive or generic the generated text is by examining which token could follow another token [33]. For our purpose we used the following three sampling methods for decoding as they provided the best output: Temperature Sampling With this approach, a probability distribution is shaped through temperature [34]. The parameter 𝑡 ∈ [0, 1) steers the distribution towards high probability tokens or low probability tokens. A low value improves the generation quality but decreases the diversity of the output. A high value regularizes the generation by making the model less certain of its top choices [32, 35]. We used a temperature of 1.4. Top-K Sampling In this sampling method, the neural language model distribution gets truncated to a set of size 𝑘 of most likely tokens. These tokens will then be filtered and the probability mass is redistributed among those 𝑘 next tokens [35, 36]. We chose a 𝑘 of 75. Nucleus Sampling Nucleus sampling is a stochastic decoding method. Like top-𝑘 sampling, this method is also truncating the language model distribution. It chooses from the smallest possible set of words whose cumulative probability exceeds the probability 𝑝. The rest of the tokens get discarded afterwards. It is also called top-𝑝 sampling [35]. We set the probability to 𝑝 = 0, 6. With these 3 decoding strategies and the six prompts we generate 18 text outputs with a length of 75 tokens for each passed argument. In order to figure out which generated output would fit best to the passed argument, we compute the cosine similarity for each sentence with the first argument. Some cosine similarity scores are quite high because GPT-2 was just repeating the passed argument in it’s output (Table 3). Therefore we set a threshold to remove sentences which were too similar to the prompt. Among the remaining generated sentences, the best one is passed to Doc2Vec and BERT for semantic search. With this approach we encounter the problem that we always have a conclusion as part of the sentence pair and that some conclusions are quite short, therefore carrying little meaning. To counteract this, we implement the following step in our pipeline: Doc2vec The results of ElasticSearch, the most relevant arguments to a query, are passed on to Doc2Vec. For each sentence of the argument, GPT-2 generates a subsequent text output. Afterwards we compute the cosine similarity of each text output with each sentence of the respective argument and use the most similar sentence pair. SBERT The conclusion of the best argument retrieved by ElasticSearch is used as input for SBERT. The language model embeds the conclusion and retrieves the five sentences with the highest semantic similarity. For each of these sentences, GPT-2 generates a subsequent text output. These text outputs are again passed on to SBERT which searches similar sentences. The Generated Output Score What do you think? We need more sex education in schools. No because that’s what the kids do. We need more sex education in schools. Do you agree with the idea that we should have sex education in schools? No. We should have sex education in schools. Do you agree with the idea that we should have sex education in schools? 0.8109 What do you think? We need more sex education in schools. I don’t know what to say. I think that sex education is just not enough. If you don’t get it right, it’s not going to work. I think that if we are really concerned about the sex education system, we need to have a national sex education program, where all ... 0.7838 What do you think? We need more sex education in schools. Not sure ive seen any evidence that sex education is helping to reduce sexual violence. 0.7396 Table 3 This table displays an example of the top three out of 18 sentences ordered by the cosine similarity to the query “sex education in schools”. The expanded query is italic and everything following is generated by GPT-2. result of SBERT are then matched with the first argument and the most similar sentence pair form the desired argumentative sentence pair. 5. Evaluation And Discussion We use a a two-step approach for our own evaluation of the results of our retrieval system. In the first step we evaluate the results of the argument retrieval and in a second step we evaluate the sentence pairs generated from the results of the argument retrieval. This ensures that the sentence pairs are generated out of a well evaluated baseline of argument retrieval. We use the mean Normalized Discounted Cumulative Gain (NDCG) to measure the quality for both steps over several topics [37]. This measure is also used to compare the different approaches for the Touché 2021 tasks for argument retrieval [5]. So we are also able to compare our results of the argument retrieval to the results of the previous results of Touché 2021. We also evaluate the system in the context of this years Touché Lab. The results from the TIRA platform will be discussed at the end of this section. 5.1. Argument retrieval evaluation We use a dataset with graded results for 50 example topics for the evaluation of the results from the argument retrieval. The dataset was also used for the relevance judgments for the Touché 2021 task 1 [38]. The entries in this dataset contain a topic id, the argument id and a grade which indicates how relevant a argument is for a topic. The grades can have the values -2 non-argument (spam) and from 0 (not relevant) to 3 (highly relevant). In Table 4 the results for different approaches on argument retrieval are displayed. The base- line retrieval uses the fields conclusion, premises.text and sentences.sent_text from the argument corpus with the weight of one for each field and LMDirichlet as retrieval model [6]. For the optimized baseline approach the weights of the fields which are used for retrieval are optimized (in a range from zero to three) with a bayesian optimizer[39]. The measure of the optimization is the Mean NDCG value as described before. The expanded index approach is the one which is described in Section 3.1. In the optimized expanded index approach the weights for the search fields are also optimized with a bayesian optimizer and in the heavily optimized expanded approach the optimization is made with more iterations but as described in Table 4 the NDCG values are staying the same. Compared to the Touché 2021 task 1 approaches with the best relevance scores (Elrond with a NDCG@5 value of 0.720 and Pippin Took with a NDCG@5 value of 0.705) the optimized expanded index approach (The expanded index approach from section 3.1 with optimized weigths of the search fields) scores slightly better. Method Mean NDCG Mean NDCG@5 Baseline 0.8385 0.6754 Baseline, optimized 0.8559 0.7116 Expanded Index 0.8506 0.7166 Expanded Index, optimized 0.8624 0.7417 Expanded Index, heavily optimized 0.8624 0.7417 Table 4 The Mean NDCG and NDCG@5 values for different approaches on argument retrieval 5.2. Sentence pair evaluation For the evaluation of the retrieved sentence pairs the evaluation data has to be created manually. Therefore a dataset similar to the dataset for the argument retrieval evaluation is created based on the retrieved sentence pairs for a topic. Then the results are reviewed by hand and are graded by the relevance to the topic and how good the sentences fit together in an argumentative way. To keep the effort within an acceptable range the evaluation is made for ten top results of each of ten topics. As shown in Table 5 the Doc2Vec method without text generation from GPT-2 returns the best results for the sentence pair retrieval. The baseline approach uses random pairs of sentences of the first ten results. It is striking that the Bert approach performs even worse than the random pairs approach. Method Mean NDCG Mean NDCG@5 Random Pairs 0.1775 0.1121 Doc2Vec 0.5572 0.3644 Doc2Vec + GPT-2 text generation 0.3235 0.2283 Bert + GPT-2 text generation 0.05 0.05 Table 5 The Mean NDCG and NDCG@5 values for different approaches of sentence pair retrieval 5.3. Final evaluation on TIRA In the previous section, the results of the teams own evaluation were described. The retrieval of sentence pairs is also evaluated in the context of Touché 2022 [3]. For this, we are provided with the TIRA platform [11]. Researchers can submit their solution for shared tasks and evaluate the results on TIRA. We choose the configuration which performed best in our own evaluation, that is Doc2Vec without GTP-2 text generation. The evaluation results for our submitted run, named korg9000, can be seen in Table 6. Category Mean NDCG@5 CI95 Low CI95 High Relevance Evaluation 0.252 0.187 0.318 Quality Evaluation 0.453 0.384 0.529 Coherence Evaluation 0.168 0.117 0.223 Table 6 The evaluation results for the run on the TIRA platform by category Our retrieval system performed poorly both in comparison to our own evaluation and to the results of the other participants of the shared task. The Mean NDCG@5 for relevance is 0.252, compared to the best performing run from Team Porthos with a Mean NDCG@5 of 0.742. The Mean NDCG@5 for quality is 0.453. The best performing team in this category is Daario Naharis with a Mean NDCG@5 of 0.913. For the coherence evaluation, our run scores a Mean NDCG@5 of 0.168, while the best performing team, again Daario Naharis, has a score of 0.458. 5.4. Discussion of the final results The approach of our experiments outlined in this paper is to compare different types of semantic search. Within our own evaluation, we were able to compare the retrieved sentence pairs of Doc2Vec and SBERT. The results were mixed in both cases. This can be attributed to the differences between the two methods: While it was easy to train the Doc2Vec model ourselves, the model was not designed to work with short sentences, but with paragraphs in mind. For SBERT, it was outside of the scope of this notebook to customize the training, which could have led to the poor results. We use the same pipeline to generate the input for the semantic search, but we could have benefited from a step-wise evaluation of each step of the pipeline. Furthermore more runs on the TIRA platform with different configurations would have been helpful to test out our approaches. Then the strengths and weaknesses of the respective approaches could have been better analysed and combined. 6. Conclusion To solve Touché task 1, we developed a system to find pairs of argumentative sentences for positive and negative stances towards an arguement. We worked with the args.me corpus and two different approaches: Doc2Vec and Sentence BERT, and combining both of them with GPT-2, finally reembedding the resulting sentences. For the retrieval of arguments we use ElasticSearch. In the end we compare and evaluate our results with the NDCG and Mean NDCG. We show the differences in how the sentences are paired have a great impact on the overall results of the retrieval system. The Doc2Vec approach delivers acceptable results after reworking the sentence pairing algorithm to only use sentences from the respective argument. Furthermore using text generation to improve the input for the sentence embedding models does not provide better results by default. On the contrary, the results for Doc2Vec are better without the additional step of generating a more elaborate prompt. The quality of the sentence pairs depends on the quality of the generated sentences from GPT-2. Using GPT-2 in this context needs more fine-tuning to reduce the randomness in the generated texts. This could be a starting point for further improvements as it is possible to use GPT-2 to generate more than one sentence and check if one of those sentences fits better. This is contrasted by the negative impact on performance by the sentence generation, which slows down the entire pipeline. Doc2Vec without additional text generation delivers the best performance in the context of our own evaluation. Unfortunately, Sentence BERT performs really poorly in our experiments. The SBERT approach in combination with the text generation using GPT-2 has a significant lower Mean NDCG/Mean NDCG@5 score in the evaluation than the baseline approach with random sentence pairs. The poor results could stem from not further fine-tuning SBERT or not choosing the right model for this task. The algorithm to match the two desired sentences needs to be revised to produce better results. There is more room for further improvements of our pipeline, like other steps in preprocessing, using part of speech tags or trying to find a subject object relationship between argumentative sentence pairs to get a better understanding of the context of a sentence. Acknowledgments Thanks to Theresa Elstner for the extensive feedback on the draft of this notebook and to the team of Temir https://temir.org/people.html for supporting us with the upload of our software on the Tira https://tira.io/ platform to evaluate our retrieval system. References [1] H. Wachsmuth, M. Potthast, K. Al-Khatib, Y. Ajjour, J. Puschmann, J. Qu, J. Dorsch, V. Morari, J. Bevendorff, B. Stein, Building an Argument Search Engine for the Web, in: K. Ashley, C. Cardie, N. Green, I. Gurevych, I. Habernal, D. Litman, G. Petasis, C. Reed, N. Slonim, V. Walker (Eds.), 4th Workshop on Argument Mining (ArgMining 2017) at EMNLP, Association for Computational Linguistics, 2017, pp. 49–59. URL: https://www. aclweb.org/anthology/W17-5106. [2] Y. Ajjour, H. Wachsmuth, J. Kiesel, M. Potthast, M. Hagen, B. Stein, Data Acquisition for Argument Search: The args.me corpus, in: C. Benzmüller, H. Stuckenschmidt (Eds.), 42nd German Conference on Artificial Intelligence (KI 2019), Springer, Berlin Heidelberg New York, 2019, pp. 48–59. doi:10.1007/978-3-030-30179-8\_4. [3] A. Bondarenko, M. Fröbe, J. Kiesel, S. Syed, T. Gurcke, M. Beloucif, A. Panchenko, C. Bie- mann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2022: Argu- ment Retrieval, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association (CLEF 2022), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2022, p. to appear. [4] Clef initiative, 2010. URL: http://www.clef-initiative.eu/, accessed: 2022-02-27. [5] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument Retrieval, in: K. Candan, B. Ionescu, L. Goeuriot, H. Müller, A. Joly, M. Maistro, F. Piroi, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. 12th International Conference of the CLEF Association (CLEF 2021), volume 12880 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2021, pp. 450–467. URL: https://link.springer.com/chapter/10.1007/978-3-030-85251-1_28. doi:10. 1007/978-3-030-85251-1\_28. [6] C. Zhai, J. Lafferty, A study of smoothing methods for language models applied to ad hoc information retrieval, in: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, Association for Computing Machinery, New York, NY, USA, 2001, p. 334–342. URL: https: //doi.org/10.1145/383952.384019. doi:10.1145/383952.384019. [7] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, 2013. arXiv:1301.3781. [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. arXiv:1810.04805. [9] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised multitask learners, 2019. [10] T. Gollub, B. Stein, S. Burrows, D. Hoppe, Tira: Configuring, executing, and disseminating information retrieval experiments, in: 2012 23rd International Workshop on Database and Expert Systems Applications, 2012, pp. 151–155. doi:10.1109/DEXA.2012.55. [11] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/ 978-3-030-22948-1\_5. [12] M. Pasca, Web-based open-domain information extraction, in: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 2605–2606. URL: https://doi.org/10.1145/2063576.2064034. doi:10.1145/2063576.2064034. [13] R. Rinott, L. Dankin, C. Alzate Perez, M. M. Khapra, E. Aharoni, N. Slonim, Show me your evidence - an automatic method for context dependent evidence detection, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Lisbon, Portugal, 2015, pp. 440–450. URL: https://aclanthology.org/D15-1050. doi:10.18653/v1/D15-1050. [14] T. Bench-Capon, P. E. Dunne, Argumentation in artificial intelligence, Artificial In- telligence 171 (2007) 619–641. URL: https://www.sciencedirect.com/science/article/pii/ S0004370207000793. doi:https://doi.org/10.1016/j.artint.2007.05.001, argu- mentation in Artificial Intelligence. [15] I. Rahwan, F. Zablith, C. Reed, Laying the foundations for a world wide argument web, Artificial Intelligence 171 (2007) 897–921. URL: https://www.sciencedirect.com/science/ article/pii/S0004370207000768. doi:https://doi.org/10.1016/j.artint.2007.04. 015, argumentation in Artificial Intelligence. [16] M. Potthast, L. Gienapp, F. Euchner, N. Heilenkötter, N. Weidmann, H. Wachsmuth, B. Stein, M. Hagen, Argument search: Assessing argument relevance, in: Proceedings of the 42nd In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 1117–1120. URL: https://doi.org/10.1145/3331184.3331327. doi:10.1145/3331184.3331327. [17] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv. org/abs/1810.04805. arXiv:1810.04805. [18] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, 2013. arXiv:1301.3781. [19] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, 2013. arXiv:1310.4546. [20] R. Krovetz, Viewing morphology as an inference process, in: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’93, Association for Computing Machinery, New York, NY, USA, 1993, p. 191–202. URL: https://doi.org/10.1145/160688.160718. doi:10.1145/160688.160718. [21] C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, Language, Speech, and Communication, MIT Press, Cambridge, MA, 1998. [22] C. Zhai, J. Lafferty, A study of smoothing methods for language models applied to ad hoc information retrieval, SIGIR Forum 51 (2017) 268–276. URL: https://doi.org/10.1145/ 3130348.3130377. doi:10.1145/3130348.3130377. [23] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument Retrieval, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. 11th International Conference of the CLEF Association (CLEF 2020), volume 12260 of Lecture Notes in Computer Science, Springer, Berlin Hei- delberg New York, 2020, pp. 384–395. URL: https://link.springer.com/chapter/10.1007/ 978-3-030-58219-7_26. doi:10.1007/978-3-030-58219-7\_26. [24] H. Bast, B. Buchhold, E. Haussmann, Semantic search on text and knowledge bases, Foundations and Trends® in Information Retrieval 10 (2016) 119–271. URL: http://dx.doi. org/10.1561/1500000032. doi:10.1561/1500000032. [25] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. arXiv:1908.10084. [26] Q. V. Le, T. Mikolov, Distributed representations of sentences and documents, 2014. arXiv:1405.4053. [27] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition and clustering, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015). URL: http://dx.doi.org/10.1109/CVPR.2015.7298682. doi:10.1109/cvpr. 2015.7298682. [28] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library, in: Advances in Neural Information Processing Systems 32, Curran Associates, Inc., 2019, pp. 8024–8035. URL: http://papers.neurips.cc/ paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- sukhin, Attention is all you need, 2017. arXiv:1706.03762. [30] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, M. Zhou, Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020. URL: https://arxiv.org/ abs/2002.10957. doi:10.48550/ARXIV.2002.10957. [31] C. Hegde, S. Patil, Unsupervised paraphrase generation using pre-trained language models, 2020. arXiv:2006.05477. [32] C. Akiki, M. Potthast, Exploring Argument Retrieval with Transformers, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes Papers of the CLEF 2020 Evaluation Labs, volume 2696, 2020. URL: http://ceur-ws.org/Vol-2696/. [33] A. Géron, Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems, first edition, fifth release ed., O’Reilly, Beijing, 2018. URL: https://katalog.ub.uni-leipzig.de/Record/0-1640048871. [34] D. H. Ackley, G. E. Hinton, T. J. Sejnowski, A learning algorithm for boltzmann machines, Cognitive Science 9 (1985) 147–169. URL: https://www.sciencedirect.com/science/article/ pii/S0364021385800124. doi:https://doi.org/10.1016/S0364-0213(85)80012-4. [35] A. Holtzman, J. Buys, L. Du, M. Forbes, Y. Choi, The curious case of neural text degeneration, 2020. arXiv:1904.09751. [36] A. Fan, M. Lewis, Y. Dauphin, Hierarchical neural story generation, 2018. arXiv:1805.04833. [37] K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of ir techniques, ACM Trans. Inf. Syst. 20 (2002) 422–446. URL: https://doi.org/10.1145/582415.582418. doi:10.1145/ 582415.582418. [38] webis.de, Touché 2021 relevance judgements, 2021. URL: https://webis.de/events/ touche-22/shared-task-1.html, accessed: 2022-02-27. [39] J. Močkus, On Bayesian Methods for Seeking the Extremum, Springer Berlin Heidelberg, Berlin, Heidelberg, 1975, pp. 400–404. URL: https://doi.org/10.1007/978-3-662-38527-2_55. doi:10.1007/978-3-662-38527-2_55.