=Paper=
{{Paper
|id=Vol-2936/paper-215
|storemode=property
|title=Argument Retrieval for Comparative Questions based on independent features
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-215.pdf
|volume=Vol-2936
|authors=Thi Kim Hanh Luu,Jan-Niklas Weder
|dblpUrl=https://dblp.org/rec/conf/clef/LuuW21
}}
==Argument Retrieval for Comparative Questions based on independent features==
Argument Retrieval for Comparative Questions based on independent features Notebook for the Touché Lab on Argument Retrieval at CLEF 2021 Thi Kim Hanh Luu1 , Jan-Niklas Weder1 1 Martin Luther University of Halle-Wittenberg, Universitätsplatz 10, 06108 Halle, Germany Abstract In this paper, we present our submission to a shared task on argument retrieval for comparative ques- tions at CLEF 2021. For the given comparative topics, we retrieved relevant documents from the ClueWeb12 corpus using BM25-based search engine ChatNoir and rerank them with our approach. Our approach combines multiple natural language processing techniques such as part-of-speech (POS), word embed- dings, language models such as BERT, argument mining, and other machine learning methods. Using the TARGER Tool, BERT or PageRank we generated scores which are then used for the re-ranking. A Support Vector Machine is used to learn weights for the final ranking of documents on basis of those scores. The presented results on the given topics from the shared task last year evaluated by nDCG@5 measures showed, that one configuration of our approach can improve the nDCG@5 by approx 0.07 if we only consider the evaluated documents. Furthermore, one of our approaches reached the fourth place in the Argument Retrieval for Comparative Questions task. Keywords information retrieval, comparative search engine, argument retrieval, natural language processing, ma- chine learning 1. Introduction Every person has to make decisions several times a day. Some of these decisions hardly have any influence on the person, some decisions can have a decisive impact on the person’s life. An example of such an important decision might be at which university one should study or for which job one should apply. There is often a point where you have to weigh several options against each other and are forced to choose one of them. So it could be that you have to choose one of the universities. For this decision, you need the information that supports each side. Here it would be useful if there was some kind of search engine that would provide the pros and cons of a given comparative question. This is exactly the kind of problem that Touché Task 2 is concerned with. The goal here is to generate a ranking for documents from the ClueWeb12 corpus using search engine ChatNoir so that documents that provide great value for a comparative question are ranked as high as possible [1]. We present in the following our approach with which we aim at solving this task and providing people with the best possible arguments so that they can weigh up for themselves and finally make an informed decision. CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " Thi.Luu@student.uni-halle.de (T. K. H. Luu); Jan-Niklas.Weder@student.uni-halle.de (J. Weder) ~ https://github.com/hanhluukim (T. K. H. Luu); https://github.com/JanNiklasWeder (J. Weder) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) As an objective, we can formulate that we aim to improve with respect to last year’s baseline and ideally also reach most effective approach of the last year. It should be mentioned here that there was only one submission last year that exceeded the baseline and that had an improvement for the nDCG@5 of 0.012 [2]. The structure of the paper is as follows. First, we review some related approaches that describe the basis of our software in more detail. This part is followed by our query expansion steps that extend the original query to retrieve additional documents that also match the searched topic. After that, we deal with the scores used for improving the ranking. The scores are followed by an overview of our entire pipeline along with the options provided by the architecture we have chosen. The next topic will be the results we got from applying our pipeline to last year’s dataset. The paper is then closed by the conclusion, which reviews the most important parts of the pipeline and suggests some possible improvements. 2. Related Work We have used some areas in our approaches such as argument mining and synonyms extraction. 2.1. Argument Mining Argument mining is the process of automatically extracting arguments from unstructured texts. This process is becoming increasingly important with regard to the automatic processing of texts from the web [3]. For a comparative question, a search engine can use argument mining to find relevant documents on the web that contain arguments regarding a concept. There are several developments for argument mining by applying natural language processing (NLP) methods. We chose TARGER because it allows us to use the software through an API and was designed explicitly for web documents in mind. TARGER is an open-source neural argument mining tool for tagging argument units, like premises and claims [4]. 2.2. Synonyms extraction Synonyms extraction is a research problem, which is helpful to text mining and information retrieval [5]. For example, an automatic query expansion using synonyms is capable of improving the retrieval effectiveness [6]. The WordNet database can be used to obtain synonyms of an English word [7]. This method can lead to a problem of semantic gap [8], for example, i go to the bank and i sit on the bank. To find the right synonyms for this word bank, we need to understand the context of these sentences. The use of this dictionary is also sometimes expensive because of the manual building and maintaining of database [9]. An improved approach is to incorporate word embeddings to deal with a semantic problem [10]. There are several methods to calculate the similarity between words using word embeddings, such as Manhattan distance [11]. In our approach, we use cosine similarity for synonyms extraction. 3. Query expansion User queries sometimes do not accurately reflect the user’s information needs. For example, the queries are too short or the user uses other semantically similar words, that are not available in the retrieval system. To improve the retrieval process, the users’ queries can be expanded through different query expansion methods [12]. In our approach, we generated several new queries based on the given query using natural language processing, part-of-speech method, word embeddings and combinations of them. The following explains how the query expansion was implemented. 3.1. Expansion methods In our first method Preprocessing, a new query is generated by lemmatization of a single word of the original query, for example, Which four wheel truck is better: Ford or Toyota? to Which four wheel truck be good: Ford or Toyota?. In this method, we are not interested in a concrete comparison between two objects Ford and Toyota, but it is expected that this transformation to lemmas will return more results about these objects from the retrieval system. To note here is that preprocessing belongs to the query expansion step but is separate from other expansion methods in our implementation. To focus on the comparison part of a query sent by the user, we assume in the second expansion method that the search engine should identify the terms from the query that stay on a comparative relationship, for example: four wheel truck, better, Ford and Toyota are comparative terms. In the retrieval process, the search engine should find documents, which contain these comparative terms. In this technique, the part-of-speech (POS) tagger from the spaCy language model [13] is used to select these terms and generate a new query from them (see 1). We assume that the comparative terms of a query can have the following POS tags: number, verbs, adjectives, adverbs, nouns, proper nouns. Numbers, adjectives, nouns, and proper nouns can represent the objects to be compared in the query. When comparing two actions, such as should I buy or rent?, verbs and adverbs can play the role of the comparative terms. The newly created query in this method with POS contains only these comparative terms and other not comparative terms will be removed. Figure 1: Identification of Part-Of-Speech-Tags visualized by https://spacy.io/usage/visualizers To deal with the problem of semantically similar words, for example, the user uses notebook in his query and a factually relevant text contains only the word laptop, three methods are implemented using WordNet [7], word embeddings [10], and word sense embeddings [14]. The WordNet method expands the original query by adding synonyms of comparative terms, which have NOUN-tag, to the end of the original query. The two other embedding based methods replace the comparative terms by found synonyms to create a new query. The comparative terms used here have already been identified in the second method above. In the first variant, we use WordNet dictionary, a lexical database for the English language to find the synonyms. The more specific a query is, the higher the probability that no document contains all words of the query. It can lead to a problem that no results from ChatNoir will be returned. For this reason, only the first five synonyms returned by WordNet for each word of NOUN-comparative terms are used. For specific names such as Google, Yahoo, and Canon in a query, no synonyms can be found by the WordNet dictionary, however similar terms such as search engine, camera or other similar organizations such as Bing, Nikon are sometimes helpful in the retrieval process. For example, the user does not want to compare the financial factors of the organizations with the query which is better, Google or Yahoo?, but is interested in the functionality of the two search engines. The similar word search engine can help to describe the information needs of the user clearer in this example. Moreover, specific names of organizations may be changed after time and only a few documents with the old names are found. In this case, the documents with the new or similar names may be relevant to the user’s query. Therefore, semantic similar words by using wordembeddings are searched in our approach and used for the two next query expansion embedding methods. For each comparative term, the top two similar words will be selected by the two highest cosine similarities. We use models of word2vec embeddings and sense2vec from the spaCy library to generate word vectors for the comparative terms [13][15]. The word2vec model for the English language of spaCy was trained on the English dataset of OntoNote 5.0, which contains various genres of text, like news and web data [16]. The sense2vec model was trained on the 2015 portion of the Reddit comments corpus [17]. Both pre-trained models are suitable for our task with web documents. In the word2vec method for each word from the comparative terms, the top two similar words are determined using cosine similarity. If a candidate word is not present in the vocabulary of word2vec, no similar words are extracted. The sense2vec model is an extension of word2vec, where separate embeddings are learned for each sense of word, based on its POS tag. For example: in the query Which is better, Canon or Nikon?, Canon is recognized as a Proper Noun by POS Tagger or as a PRODUCT by Named Entity Recognition. In this case, similar words should be searched with the consideration of the entity PRODUCT. To make it possible, we use the sense2vec model to find similar words also by cosine similarity. After the top two similar words have been found for each comparative term using word2vec and sense2vec, we replace these terms with their associated similar words. This replacement process can generate several new queries, for example which is better, laptop or desktop? will be extended to which is better, notebook or desktop? and which is better laptop or PC?, because notebook is the similar word of laptop and PC is the similar word of desktop. Suppose we have 𝑛 comparative terms in our query. If two similar words are found for each term, there are 2 * 𝑛 new queries created by word2vec and also 2 * 𝑛 by sense2vec. If all above expansion methods are used, we have then a maximum of 3 + 2 * (2𝑛) new queries from the given original topic. Since we need a final ranking without duplicates in the end search results, all retrieval documents from this part will be merged in the next step. 3.2. Merging After query expansion, the original query and all other expanded queries are sent separately to search engine ChatNoir [1]. We keep all punctuation throughout the queries. For each query, 100 documents are by default retrieved. This means that for each original topic, a maximum of 100 * (3 + 4 * 𝑛) documents are called. Since the submitted queries are similar, the set of retrieved documents usually contains multiple duplicates but with different scores. This problem motivates us to filter the search results so that duplicate documents are not shown to the user. In our approach, the identical documents returned for each query have different relevance scores and at the same time have different importance concerning the user’s expectation. We assume that the documents for the original queries are more likely to be relevant to the user than the documents from the extended queries, i.e. the documents retrieved by the original queries meet the user’s information need better. In advance, each query expansion method is assigned a different importance weight. In our default approach, we use these values for weights: 𝑤original = 2, 𝑤pos = 1.5. Other expanded queries will be assigned with the same importance weight value: 𝑤𝑤𝑜𝑟𝑑𝑛𝑒𝑡 = 𝑤𝑤𝑜𝑟𝑑2𝑣𝑒𝑐 = 𝑤𝑠𝑒𝑛𝑠𝑒2𝑣𝑒𝑐 = 1. Given 𝑇 is tag names of all queries: 𝑇 = {original, pos, wordnet, word2vec, sense2vec}. If a document 𝑑 occurs with different relevance scores 𝑟𝑑𝑡 in the different expansion methods 𝑡 ∈ 𝑇 , these relevance scores are recalculated by multiplication with predefined importance weights. Using these weighted scores, a merge process is performed. We tested two merge functions. In the first variant, the maximum weighted score is selected from all weighted scores and assigned to the document: 𝑟𝑑 = max(𝑟𝑑𝑡 * 𝑤𝑡 ). We assume here that the document is rated as relevant to the user query if it is highly ranked in at least one expansion method. In the second variant, the average of all weighted scores is assigned to the document: 𝑟𝑑 = average(𝑟𝑑𝑡 * 𝑤𝑡 ). In this case, the document is relevant if it has a high relevance score in all query expansion methods. 4. Scores 4.1. Argumentative scores The motivation for our argumentative score approach is based on the fact that an answer suitable for a comparative question should be supported by multiple statements and premises. To implement it, the argument mining system TARGER with the model classifyWD_dep was used, because documents from the ClueWeb12 dataset were collected from English websites and the model classifyWD_dep was trained on the similar dataset web discourse [4] [18]. The TARGER system identifies argumentative units, premises and claims, on the token level in the input document. It returns for each token a label and related probability. An argumentative score is computed for the document by using the average of valid returned probabilities. In our approach, a returned probability is valid, if its token is at the beginning or inside of premises or claims in the document. The outside labels from TARGER are not relevant. If ℒ is the set of all labels of valid argumentative units for premise and statements. For each word 𝑤 from the document 𝑑, TARGER outputs a tuple (𝑙𝑤 , 𝑝𝑤 ) of the most likely predicted argumentative unit label and corresponding prediction probability. We use only (𝑙𝑤 , 𝑝𝑤 ) if 𝑙𝑤 ∈ ℒ is a valid argumentative unit. 𝑑ℒ consists of only words 𝑤 with 𝑙𝑤 ∈ ℒ. The argumentative score for the document 𝑑 is then computed as follows: 1 ∑︁ 𝑝= 𝑝𝑤 |𝑑ℒ | 𝑤∈𝑑ℒ ,𝑙𝑤 ∈ℒ {︃ 𝑝 𝑝≥𝜃 arg_score(𝑑) = 0 otherwise In our approach, the threshold 𝜃 defines the minimum argumentativeness a document needs to have. A document is considered an argumentative document if its argumentative score is greater than 0.55. If the threshold is higher than 0.55, we hardly get argumentative documents. Our experiments with topics from the shared task last year have also shown, that the threshold value 0.55 had provided the best result. 4.2. Trustworthiness Another important characteristic of a good argument is its truthfulness, or if this information cannot be derived directly, at least the reliability of the source. We concluded that another score that encodes exactly this information may be beneficial for our problem. Unfortunately, determining the trustworthiness of a website is not a simple task and may not even be clearly definable. Utilizing PageRank we try to extend our pipeline with a trust component. This should allow us to weigh the different answers against each other and thus create another score on which we can rely [19]. In a rough summary, Pagerank is a way to define the relevance of a website without ana- lytically processing its content. PageRank accomplishes this by taking advantage of the link structure of the Web. The assumption here is that there links to reliable websites are more frequent than links to unreliable ones and that reliable websites in turn mainly link to reliable websites. Thus, the link structure of the web is interpreted as a self-assessment of trustworthi- ness. PageRank now only reflects this structure in a single value per website. So this score can be seen as some kind of peer review [20]. Here we use PageRank to obtain a pseudo trustworthiness. For this, we will use a reim- plementation of PageRank, OpenPageRank [21]. In the following, we will briefly look at the correlation between PageRank and the ratings we know from last year’s evaluation to make a statement about the information content of this score alone [2]. If we now look at Figure 2, we can see that there is no clear relation between PageRank and the ratings of the documents. This assumption is supported by a small Spearman correlation of about -0.003. Since one would expect strictly monotonic relationships if PageRank could make a statement about the Qrel score, these values speak against this assumption. Therefore, it can be assumed that the PageRank score will not provide much additional value. 7.5 PageRank 5.0 2.5 0.0 0 1 2 Qrel Score Figure 2: The PageRank obtained depending on the relevance score of the documents. Zero means not relevant, one denotes relevant and two indicates very relevant. The PageRank values range from 0 to 10 with 0 being the lowest and 10 being the best. Each dot represents a document. 4.3. Relevance Prediction Due to the availability of assessed documents based on their relevance for a specific topic from last year, we decided to try a supervised learning approach. Especially since the results from last year utilized BERT to embed documents and topics and calculated a similarity score based on those embeddings, worked comparable well [2]. We decided to build on this idea and use BERT to predict whether a particular document is relevant or not. BERT stands for Bidirectional Encoder Representations and is an open-source machine learning framework for natural language processing [22]. So the problem defined here was interpreted as a two-sentence classification problem, as it is called in the SimpleTransformers library [23]. This means that we use two sentences as input and receive a class as the result. The two input sentences consist on the one hand of the query and on the other hand of the complete document for which we want to determine the relevance. It should be mentioned that the implementation used here results in the structure of the input for BERT being such that the query and the document are separated by the separator token. Thus the input consists of the classifier token followed by our query which is in turn followed by the first separator token then the corresponding document is added and finally the second separator token is appended. Furthermore, it should be mentioned here that we predict the importance labels with the help of BERT. Additionally, it is worth mentioning that we predict the importance labels with the help of BERT. So we obtain 0, 1, or 2, whereby these values represent the classes irrelevant, relevant, and very relevant. Our goal was to train BERT so that we obtain the appropriate relevance scores as a result. As base model, we use bert-base-cased. SimpleTransformers uses Hugging Face in the background to provide the corresponding pre-trained models [24]. It would be therefore straightforward to use other models instead of BERT or bert-base-cased. The decision to use specifically this model is mainly due to the hardware we have available. The chosen parameters for the Fine-tuning process were determined based on the cross-entropy, and we used the parameters that provided the best cross-entropy. The most important parameters are the learning rate, which is 1 * 𝑒−8 . The learning rate that is applied only to the classification layer which is 1 * 𝑒−5 , the number of epochs that we set to 10, and the batch size which is 4. From the two different learning rates you might already see that we make almost no changes to the main model of BERT. This turned out to produce a better cross-entropy than training the whole model with a higher learning rate. We decided not to use the option of stopping the training earlier since this led to a deterioration of the cross-entropy in our experiments. Because the classes differ significantly in the number of associated data, we weighted each class with 1 − 𝑤𝑐 , where 𝑤𝑐 is the relative frequency of class 𝑐. For the training, we used the evaluated documents of last year’s task [2]. 4.4. Combining the individual scores To combine the scores, we use a support vector machine. This allows us to determine a weighting for the individual scores based on the documents already evaluated last year [2]. As input for the SVM, we use the previously calculated scores and the score provided by ChatNoir with the request. If necessary, individual scores can be included or left out from the calculation of the final score. However, this does not apply to the score provided by ChatNoir, which is always used. Since this is not just a classification problem, but the task is to compute a score, we use support vector regression. This means we get real numbers from our SVM and not labels as it was the case for example with BERT in chapter 4.3. For the implementation, we utilize Scikit-learn and keep the default settings [25]. Important is that we normalize the scores used as inputs by using the z-score standardization. Furthermore, these normalized values are then mapped by a sigmoid function onto the value range between zero and one. The standardization described and used in the implementation is defined in more detail in equation (1). 𝑖𝑛𝑝𝑢𝑡 − 𝜇𝑖 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑_𝑠𝑐𝑜𝑟𝑒 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑( ) (1) 𝑠𝑑𝑖 {︃ 1 if 𝑥 ≥ 0 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑥) : = 1+𝑒−𝑥 𝑥 (2) 𝑒 1+𝑒𝑥 else where: 𝑖𝑛𝑝𝑢𝑡 = a value belonging to one of the scores that is to be normalized 𝜇𝑖 = the previously calculated mean for the score 𝑠𝑑 = the previously calculated standard deviation for the score This is firstly to take advantage of the generally expected improved performance of an SVM on standardized data [26]. Second, the sigmoid function in particular is intended to catch outliers that could potentially overshadow the other scores, thus ensuring that all scores can exert their attributed influence on the final result. The result we receive from the SVM is used as our final score. If individual values are not available, these are set to the mean and then processed as usual. Query Pre processing Chat Noir Expansion Topics Merging (dealing with duplicates) Results Argumentative SVM Trusworthiness BERT Figure 3: Schematic illustration of the complete pipeline. The dashed lines show possible alternative routes or paths that can be disabled. Therefore, individual scores which are given to the SVM can be activated or deactivated. The only score that is always used is the one from ChatNoir. So the SVM can utilize at least one and at most four scores. 5. Complete layout of our approach The pipeline presented here consists of parts that have been already discussed in sections 3 and 4. Those parts can be individually activated. On the one hand, this allows us to examine the different segments separately in the context of the complete pipeline, but it also makes it possible to add further scores to this basis with minimal effort because the individual parts function independently of each other. Figure 3 provides a rough overview of how our pipeline is structured. There we can see that we start with the topics which are supplied as a list of queries. Those are manipulated accordingly in the preprocessing and in the query expansion step. With ChatNoir we get potential documents for the queries. At the same time, ChatNoir provides us with a first score. This score is the only non-optional in the pipeline and is therefore always used. ChatNoir is followed by the already described Merging step. After this, the selected additional scores are calculated and the intermediate results are mapped to a final score using the support vector machine. This is then stored in the standard Trec file format. Important to note here is how we take care of ties. If we have a tie between multiple documents, all documents except the first one will have 𝜖100 subtracted from their final score. 𝜖100 is defined as 100 * 𝜖, where 𝜖 is the smallest possible float. We cannot use 𝜖 directly, because we would introduce rounding errors that would cause ties to persist. 6. Results We use the nDCG@5 to evaluate our pipeline in its entirety. The nDCG is here calculated with the help of trec_eval using the measurement ndcg_cut [27]. Furthermore, the normal nDCG in the following refers to the one that considers all documents and the judged one will refer to Table 1 The nDCG@5 values for some selected configurations of our pipeline. The nDCG@5 values shown here are first shown for all components alone, these values are each separated by lines from the next category in which the number of active components is increased by one. The selection as to which combination are shown here was made on the basis of observations made during the development. The ’nDCG@5’ value describes the standard variant of the nDCG@5. The ’nDCG@5 (J)’ value, on the other hand, only takes evaluated documents into account. The best nDCG value is marked in both cases. The combinations submitted for evaluation are marked accordingly. Prepro. Queryex. Argumen. Trust. BERT nDCG@5 nDCG@5 (J) Submitted 0.5187 0.5801 × 0.5084 0.5842 × 0.4909 0.5870 × 0.4655 0.6270 × 0.4996 0.5924 × 0.5122 0.5791 × × 0.4879 0.5870 × × × 0.4595 0.6516 × × × × 0.4169 0.6221 × × × × 0.5040 0.5819 × × × × × 0.4164 0.6462 × the version that only considers evaluated documents. This corresponds to the ’-J’ flag when using trec_eval. The decision to consider the normal nDCG as well as the judged one was made since it can happen that we rightly rank documents higher than the approaches of the last year, and thus we get a worse nDCG, although the order is better. The judged nDCG provides us with an impression of the extent to which the internal order among the evaluated documents has improved or worsened. Ideally, we could do without this approximation, but potential improvements might otherwise go unnoticed. In the first part, we will look at adding individual scores which then results in us utilizing a total of two scores. As a baseline, we use the two nDCG@5 values that ChatNoir achieves. These values can be found in the first row of the table 1. We will use these two values to determine the extent to which the ranking has improved or worsened. The first thing to note here is that we always have a deterioration in the normal nDCG@5 compared to the baseline. This can be extended to all combinations of scores that we have examined in more detail and can therefore be seen in table 1. On the other hand, especially for the judged nDCG@5 adding only one score, it can be stated that we can observe an improvement for all combinations except for BERT. This noticeable difference will be further examined in one of the following passages. Here, the small change due to preprocessing or query expansion is to be expected, since by weighting the different origins of queries, the original ones are weighted with a factor of two and all others with a factor of 1.5 or 1. Due to this weighting, it is relatively unlikely that further documents slip upwards and the order in comparison to the baseline should remain mostly the same. Only strong outliers are likely to change the order, and in this case, ChatNoir itself would have classified them as a good fit. The argumentative score worsens the normal nDCG@5 here but provides us with a significant improvement in the judged one. This is also the largest judged nDCG@5 we have observed. Based on the observations already made during development, this fact is not a big surprise. The explicit value used for 𝜃 was chosen so that we get an ideal nDCG@5. The nDCG@5 shown in Table 1 could thus vary quite a bit for unknown topics and might require a corresponding adjustment of the 𝜃. The fact that the Trustworthiness Score has no or a relatively small influence was to be expected due to the very low correlation already described. Nevertheless, the improvement in the judged nDCG@5 should be emphasized here. BERT falls somewhat out of the scheme here. On the one hand, adding the score computed with BERT gives the largest normal nDCG@5 among the combinations that added only one score, yet this value is still worse than the baseline. On the other hand, BERT has the worst score on the judged nDCG@5 within the same group. A possible explanation for this conspicuousness is that, compared to the other approaches, BERT might assign higher scores to documents that have already been evaluated, and assigns comparatively few non-evaluated documents higher scores. This is especially plausible if one takes into account that BERT’s training was based on the evaluated documents of the last year and that BERT therefore should know the correct solutions. Taking this into account, the deterioration in both nDCG@5s compared to the baseline is unexpected. We now proceed with the further combination based on the order of table 1. Therefore, the first thing that follows is preprocessing together with the query expansion. Again, these two preprocessing steps together produce a relatively small change due to their respective weights. Thus, the normal nDCG@5 lies between the two previous ones and the judged nDCG@5 is identical to the better one. The combination of Argument Score and BERT is interesting at this point, as together they outperform the already well-performing judged nDCG@5 of only activating the Argument Score. In particular, this is noteworthy because based on the individually activated components, BERT does not appear to provide significant additional value. However, the improvement of the judged nDCG@5 shows that the information BERT can contribute benefits our pipeline. It is also important to note that this combination provides the best-judged nDCG@5 that we have observed in our experiment. Here it can be concluded that the information encoded by the argument score and BERT differ from each other. Furthermore, they complement each other in a way that benefits the ranking. In the following, only combinations are considered in which the preprocessing component and the query expansion are activated. We start with the Argumentative Score. As expected from the observations already described, the Argumentative Score again performs well here and provides us with a judged nDCG@5 of over 0.6. However, this has worsened minimally compared to before by about 0.005. The normal nDCG@5 has worsened by ≈ 0.05. The combination of preprocessing, query expansion, and BERT show a degradation in normal nDCG@5 of ≈ 0.008 and an improvement in judged nDCG@5 of ≈ 0.002. A more significant change can be observed in the version that uses preprocessing, query expansion, argumentative score, and BERT. This has a significant degradation of ≈ 0.07 or ≈ 0.04 when considering the normal nDCG@5 compared to the two active components respec- tively. Interestingly, this combination also yields a worse judged nDCG@5. Comparing this combination with the versions that use preprocessing query expansion and the argument score or BERT shows, as before, that BERT and argument score benefits from each other. Based on the observations described, we have decided to submit the combinations marked in Table 1. The submission was done using TIRA. TIRA is a software solution that sandboxes the submitted software in a virtual machine and keeps a copy of it. Among other things, this would make it easier to use the submitted software again in the future [28]. 7. Conclusion As a first conclusion, our most promising approach is the combination of the argument score with BERT. This combination is followed very closely by these two scores together with preprocessing and query expansion. In particular, the argument score always prevails as the most important score in our approach and plays a role in all substantial improvements. Furthermore, based on our observations, an order of utility can be established for the individ- ual areas of our software. In this order, the argument score would undoubtedly be the most important. It would be followed by BERT, even if BERT in particular only provides an improve- ment as an extension in conjunction with the argument score. After BERT there would follow the Trusworthinesss and last but not least would come Preprocessing and query expansion. Another finding is that the pipeline presented here achieves an improvement in ranking among the evaluated documents. However, this result should be taken with a grain of salt, since some parts are designed to perform well in the problem studied here. In particular BERT as well as the SVM are addressed here. The problem that arises from this is that the parameters learned may not be applicable to new problems. Nevertheless, we can state that we deliver several combinations through our pipeline that have the potential to outperform the baseline based on our observations. For the future, our approach can serve as a foundation and can be extended relatively easily with new scores. This is particularly evident in the fact that many parts of the pipeline can act independently of each other and the exact number of scores is not fixed. If our formalities are complied to, the internal data structure would only have to be extended to include the new score and all other components would take this additional information into account and include it in their calculations. A. Code and data availability Our complete code is available at https://github.com/JanNiklasWeder/Touche-21-Task-2. All necessary data will be downloaded automatically if not already locally available in the respective folder. References [1] M. Potthast, M. Hagen, B. Stein, J. Graßegger, M. Michel, M. Tippmann, C. Welsch, Chat- Noir: A Search Engine for the ClueWeb09 Corpus, in: B. Hersh, J. Callan, Y. Maarek, M. Sanderson (Eds.), 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), ACM, 2012, p. 1004. doi:10.1145/2348283.2348429. [2] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of touché 2020: Argument retrieval, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, number 2696 in CEUR Workshop Proceedings, Aachen, Germany, 2020. URL: http://ceur-ws.org/Vol-2696/paper_261.pdf. [3] M. Lippi, P. Torroni, Argumentation mining: State of the art and emerging trends, ACM Trans. Internet Technol. 16 (2016). URL: https://doi.org/10.1145/2850417. doi:10.1145/ 2850417. [4] A. Chernodub, O. Oliynyk, P. Heidenreich, A. Bondarenko, M. Hagen, C. Biemann, A. Panchenko, TARGER: Neural Argument Mining at Your Fingertips, in: M. Costa- jussà, E. Alfonseca (Eds.), 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Association for Computational Linguistics, 2019, pp. 195–200. URL: https://www.aclweb.org/anthology/P19-3031. [5] L. Zhang, J. Li, C. Wang, Automatic synonym extraction using word2vec and spectral clustering, in: 2017 36th Chinese Control Conference (CCC), 2017, pp. 5629–5632. doi:10. 23919/ChiCC.2017.8028251. [6] N. Kanhabua, K. Nørvåg, Quest: Query expansion using synonyms over time, in: J. L. Balcázar, F. Bonchi, A. Gionis, M. Sebag (Eds.), Machine Learning and Knowledge Discovery in Databases, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 595–598. [7] C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, Language, Speech, and Communication, MIT Press, Cambridge, MA, 1998. [8] F. Yan, Q. Fan, M. Lu, Improving semantic similarity retrieval with word embeddings, Concurrency and Computation: Practice and Experience 30 (2017). doi:10.1002/cpe. 4489. [9] N. Mohammed, Extracting word synonyms from text using neural approaches, The Inter- national Arab Journal of Information Technology (2019) 45–51. doi:10.34028/iajit/ 17/1/6. [10] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, 2013. arXiv:1310.4546. [11] V. M K, K. K, A survey on similarity measures in text mining, Machine Learning and Applications: An International Journal 3 (2016) 19–28. doi:10.5121/mlaij.2016.3103. [12] C. Carpineto, G. Romano, A survey of automatic query expansion in information re- trieval, ACM Comput. Surv. 44 (2012). URL: https://doi.org/10.1145/2071389.2071390. doi:10.1145/2071389.2071390. [13] M. Honnibal, I. Montani, S. Van Landeghem, A. Boyd, spaCy: Industrial-strength Natural Language Processing in Python, 2020. URL: https://doi.org/10.5281/zenodo.1212303. doi:10. 5281/zenodo.1212303. [14] A. Trask, P. Michalak, J. Liu, sense2vec - a fast and accurate method for word sense disambiguation in neural word embeddings, 2015. arXiv:1511.06388. [15] explosion, sense2vec: Contextually-keyed word vectors, https://github.com/explosion/ sense2vec, 2018. [16] R. Weischedel, M. Palmer, M. Marcus, E. Hovy, S. Pradhan, L. Ramshaw, N. Xue, A. Taylor, M. F. Jeff Kaufman, M. El-Bachouti, R. Belvin, A. Houston, Ontonotes release 5.0, https: //catalog.ldc.upenn.edu/LDC2013T19, 2013. [17] Reddit comments corpus, https://files.pushshift.io/reddit/comments/, 2016. [18] I. Habernal, I. Gurevych, Argumentation mining in user-generated web discourse, Compu- tational Linguistics 43 (2017) 125–179. URL: https://www.aclweb.org/anthology/J17-1004. [19] I. Zaihrayeu, P. P. da Silva, D. L. McGuinness, IWTrust: Improving user trust in answers from the web, in: Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2005, pp. 384–392. doi:10.1007/11429760_27. [20] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web., Technical Report 1999-66, 1999. URL: http://ilpubs.stanford.edu:8090/422/, previous number = SIDL-WP-1999-0120. [21] A. T. P. Ltd, Open page rank, https://www.domcop.com/openpagerank/, ???? Accessed: 2021. [22] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv. org/abs/1810.04805. arXiv:1810.04805. [23] T. C. Rajapakse, Simple transformers, https://github.com/ThilinaRajapakse/ simpletransformers, 2019. [24] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan- guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6. [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret- tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per- rot, Édouard Duchesnay, Scikit-learn: Machine learning in python, Journal of Machine Learning Research 12 (2011) 2825–2830. URL: http://jmlr.org/papers/v12/pedregosa11a. html. [26] S. Ali, K. A. Smith-Miles, Improved support vector machine generalization using normalized input space, in: A. Sattar, B.-h. Kang (Eds.), AI 2006: Advances in Artificial Intelligence, Springer Berlin Heidelberg, Berlin, Heidelberg, 2006, pp. 362–371. [27] C. Macdonald, I. Soboroff, B. Gamari, trec_eval version 9.0.8, GitHub, 2020. URL: https: //github.com/usnistgov/trec_eval. [28] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/ 978-3-030-22948-1\_5.