=Paper=
{{Paper
|id=Vol-2936/paper-208
|storemode=property
|title=Development of an IR System for Argument Search
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-208.pdf
|volume=Vol-2936
|authors=Marco Alecci,Tommaso Baldo,Luca Martinelli,Elia Ziroldo
|dblpUrl=https://dblp.org/rec/conf/clef/AlecciBMZ21
}}
==Development of an IR System for Argument Search==
Development of an IR System for Argument Search Notebook for the Touché Lab on Argument Retrieval at CLEF 2021 Marco Alecci1 , Tommaso Baldo1 , Luca Martinelli1 and Elia Ziroldo1 1 University of Padua, Italy Abstract Search engines are the easiest way to find the information that we need in our daily life, and they have became more and more powerful in the last years. Anyway, they are still far from perfection, and some problems afflict also the more advanced search engines. In this paper we discuss our approach to the problem of argument retrieval documenting our participation to the CLEF 2021 Touché Task 1. In particular, we present our IR system for the args.me corpus, a collection of documents extracted from web debate portals. After a pre-processing phase of the documents, we tried to use different methods like query expansion and re-ranking based on sentiment analysis. In the final part we report the results of our experiments and discuss about them and about other possible strategies that can be applied in the future. Keywords Information Retrieval, Search Engine, Argument Retrieval 1. Introduction In the last decade, our everyday life has became more and more strictly connected to the web and the use of search engines is one of the most common tasks in our daily routine. Indeed they are the easiest and most reliable way to get information about anything we need, but unfortunately they are still far from perfection. One of the problems that afflict search engines concerns the retrieval of arguments, that according to previous existing works, could be defined as a single conclusion supported by one or more premises [1]. To give our contribution to the resolution of this problem, we decided to participate to the Touché 2021 Lab [2] on argument retrieval1 proposed by CLEF2 because we believe that argument retrieval is a crucial feature , especially in these days, when the web sources such as social media community and blogs are growing faster and faster. Among two different Tasks proposed from Touchè Lab we decided to take part to Task 1 that regards argument retrieval from debates on controversial topics. The dataset is the one used by the argument search engine CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " marco.alecci@studenti.unipd.it (M. Alecci); tommaso.baldo@studenti.unipd.it (T. Baldo); luca.martinelli.1@studenti.unipd.it (L. Martinelli); elia.ziroldo@studenti.unipd.it (E. Ziroldo) ~ https://lucamartinelli.hopto.org/ (L. Martinelli) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://webis.de/events/touche-21/ 2 http://www.clef-initiative.eu/ args.me [3] and we chose to use the downloadable corpus3 . The paper is organized as follows: in Sec. 2 we describe some related works concerning argument retrieval. Then, in Sec. 3 we examine our approach to solve the task. After the pre-processing of the documents we tried to implement three different strategies : different weights for the fields of a document, query expansion with synonyms from WordNet4 and re-ranking with sentiment analysis. To select the parameters and the weights used by our methods, we relied on the scores obtained from our system using the topics from Touché 2020 Lab. Going further, in Sec. 4 we describe the experimental setup we used during our works, meanwhile Sec. 5 is for result analysis. Finally, Sec. 6 is about our final considerations and discussions for possible future works. 2. Related Work Different previous studies has been carried out to try to resolve the problem of argument search, but our starting point was the overview of the last years edition of the Touché Lab [4]. The common approach followed by the participating teams was constituted by three main parts: (1) a retrieval strategy; (2) an augmentation component like query expansion (3) a re-ranking component which modifies the score of the initially retrieved documents. The most two used model were BM25 and LMDirichlet, while other few teams used DPH or TFIDF. The argument search engine args.me [5], from which the corpus was extracted, is based on the retrieval model BM25. Anyway, previous studies compared different retrieval models and demonstrated how LMDirichlet and DPH are better suited for argument retrieval [6]. 2.1. Pre-processing A fundamental step for argument retrieval is the pre-processing of the documents. One possible approach is the one followed by Staudte et al. [7] that regards primarily the pre- processing of the words instead of the whole documents. They started with basic things such as removing punctuation, URL and square brackets, but then they also introduced more specific rules such replacing a repetition (>2) of the same letter with a single one. Indeed, in blogs and social media users frequently write in colloquial language repeating the same letter more than once. They also deleted arguments smaller than 26 words since users make short arguments to express agreement or disagreement with a previous argument rather than to express their own reasons. 2.2. Query expansion A query represents the information need by a user, but usually they are a bit too short to find the most relevant documents. For example, due to a vocabulary mismatch the Information 3 https://zenodo.org/record/3734893 4 https://wordnet.princeton.edu/ Retrieval (IR) system can discard a document that in reality was relevant. To avoid this problem query are often expanded with more terms to reduce the gap between the query and concepts that users wanted to express. The approach was followed by Akiki et al. [8] make use of GPT-2 model [9] to add argumentative text to the original query. Then a new set of queries is built from the generated sentences. Another possible solution is the use of lexical properties to add new terms to the original query. Bundesmann et al. [10] implement this strategy by adding synonyms taken from WordNet database. 2.3. Re-ranking After the retrieval of the most relevant documents, an IR system can re-rank the candidates to consider additional criteria that can involve different features of the documents. In the paper provided by Shahshahani et al. [11], they described how their final ranking is produced using the learning-to-rank library RankLib5 to incorporate argument quality and Named Entity Recognition. Their assumption is that recognized entities mean that the premises are more persuasive and effective. Another approach presented by Dumani et al. [12] is to group premises that support the same conclusion. After this is possible to calculate a score that indicate how much a premise is convincing in comparison to other premises of the same claim. The solution provided by Bundesmann et al. [10] uses a machine learning approach to process the initial documents and assign them a score indicating their argumentative quality. According to Wachsmuth et al. [13] they annotated a score for each one of these three aspects: Logical quality, Rhetorical quality and Dialectical quality. Another possible strategy to follow is the use of sentiment analysis to determine the sentiment of a document, and so how much its author is emotionally involved. Indeed to deal with argument retrieval, it is crucial to be able to understand the emotions and the writer’s frame of mind. Since several studies [3] underline that an emotional argument is more powerful than a neutral or impassive one, Staude et al. [7] decided to encourage the emotional documents combining their DPH score with the one calculated with sentiment analysis. By contrast, another team from the previous edition of Touché, decided to assign an higher score to the neutral arguments, assuming that a neutral sentiment coincides with higher relevance of a document. 3. Methodology As a starting point, we pre-processed all the documents contained in the args.me corpus, removing stop words and applying different filters. To create the index and to perform the search we relied on Apache Lucene6 . Since BM25 and LMDirichlet were the most used models in the previous edition of Touché Lab, and since also args.me search engine relies on BM25 we decided to use the Lucene’s implementation of these two models. Then we tried three different methods to improve the performance of our IR system: • Assigning different weights to different fields of the documents. • Query expansion using synonyms extracted from WordNet. 5 https://sourceforge.net/p/lemur/wiki/RankLib 6 https://lucene.apache.org/ • Re-ranking using the score obtained by performing sentiment analysis on the documents. First, we followed each one of these strategies separately to find the best parameters/weights to use with them. After, we tried to combine all the three techniques at the same time to see the effects with respect to the base implementation. 3.1. Pre Processing Our approach in creating Lucene Documents7 was to store different information in indepen- dent fields, in order to assign distinct weights to each field. Additionally to the field that store the identification number of the document and the field that store the stance of the document, we decided to create three other fields for the premises, the conclusion and the body. The body field, in particular, contains both premises and conclusion, and extra information about the document. These information are, in order, acquisition time, source URL, topic, author, author role and author organization, source domain and discussion title. We decided to not keep the source text because we noticed that it contains too much useless terms, such as copyright information, navigation menus, site map etc. . . . We decided to adopt the ClassicTokenizer8 provided by Apache Lucene. This is a simple grammar-based tokenizer constructed with the lexical analyzer generator JFlex. It’s designed to be a good tokenizer for most European-language documents: it splits words at punctuation characters, removing them. However, a dot that’s not followed by a whitespace is considered part of a token. As a result it splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split. It recognizes email addresses and internet hostnames as one token. Beyond this we implemented the LowerCaseFilter9 , in order to normalize all tokens to lower case. This also allows terms of the query to match with terms in the documents written, for example, in upper case. The next filter we used is the LengthFilter10 . This filter keeps tokens with a length between 3 and 20 characters, removing the others. A significant improvement on the score has been noted, due to the exclusion of many words not informative, such as I, be, me, a, etc. The last filter we applied is a custom filter that excludes equal consecutive letters if they’re more than three. This filter is useful to remove typos or words emphasized e.g. helllo or yesssss becames relatively hello and yess. 3.1.1. StopLists Stopword filtering is a common step in preprocessing text because it removes lots of not informative words. We realized that stoplists have a considerable impact on the nDCG@5 score. So we tried different lists, as reported in Tab. 1 and Tab. 2. The nDCG@5 was computed using only Lucene’s LMDirichlet implementation with no pre processing and without following 7 https://lucene.apache.org/core/8_8_1/core/org/apache/lucene/document/Document.html 8 https://lucene.apache.org/core/8_8_1/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html 9 https://lucene.apache.org/core/8_8_1/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html 10 https://lucene.apache.org/core/8_8_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html any of the techniques previously described, in order to take into account only the stoplists. The max score were obtanied with the stoplist EBSCOhost11 : it is a list of 24 words used in EBSCOhost medical databases MEDLINE and CINAHL. In general we saw that lists with more words generally decrease the score. In fact using an empty stoplist the score was overall on the average. After that we tried to create a stoplist with the 150 most frequent terms in the index (150 custom). We found that it has an average score, too. Then we have integrated EBSCOhost with the ten, twenty and thirty most frequent terms not yet present in the stoplist. The score with the first two attempts was just a little lower than stock EBSCOhost, and it decreases significantly adding more words (e.g. EBSCOhost+30), confirming that in this situation a small stoplist is the best solution. Stock stoplists Number of words nDCG@5 tent1 400 0.5599 Air3z4 1298 0.5757 zettair 469 0.5790 smart 571 0.5895 terrier 733 0.5919 cook1988 221 0.6043 taporwave 485 0.6068 postgre 127 0.6078 nltk 153 0.6078 lexisnexis 100 0.6131 NO STOPLIST 0 0.6189 corenlp 28 0.6211 okapi 108 0.6224 ranksnl 32 0.6249 lucene_elastic 33 0.6256 ovid 39 0.6259 lingpipe 76 0.6260 EBSCOhost 24 0.6265 Table 1: nDCG@5 scores obtained with different stock stoplists. Custom stoplists Number of words nDCG@5 150_custom 150 0.6066 ebsco+10 34 0.6258 ebsco+20 44 0.6258 ebsco+30 54 0.6123 Table 2: nDCG@5 scores obtained with custom stoplists. 11 https://connect.ebsco.com/s/article/What-are-the-stop-words-used-in-EBSCOhost-medical-databases- MEDLINE-and-CINAHL 3.1.2. Stemmers Stemming is the reduction of a word into its base form, called stem. In particular we tried in four different ways, synthesized in Tab. 3. The nDCG@5 was computed using only Lucene’s LMDirichlet implementation with no pre processing and without following any of the tech- niques previously described, in order to take into account only the stemmers. First of all we didn’t use any form of stemming. After that we tried to implement three different stemmers included in Lucene package. We started with the EnglishMinimalStemFilter12 , that sim- ply stems plural English words to their singular form. Then, in the second way we used the KStemFilter13 . This filter implements the Krovetz stemmer, an hybrid algorithmic-dictionary that produces words. For the last, we tried the most used filter in IR, the Porter stemmer, implemented in Lucene as PorterStemFilter14 , that eliminates the longest suffix possibile, working in steps and trying to delete each suffix every time, until it reaches the base form for generate stems. As already seen in Sec. 3.1.1, adding complexity to the system, the score obtained decreases, this probably due to limitations of stemmers used [14]. Stem Filter nDCG@5 No Stem 0.6265 English Minimal Stem 0.6184 Krovetz Stem 0.5747 Porter Stem 0.5401 Table 3: nDCG@5 scores obtained using different stemmers. 3.2. Different fields’ weights Since documents have more than one field to search in, at query time it is possible to assign different weights to each field. In this way, a term found in a field with an higher weight, will also have an higher impact on the final score of the document. As already explained in Sec. 3.1, we decided to have three different fields containing respectively the body, the premises and the conclusion. We noticed that premises are the most informative field, instead the conclusions are often composed by one single term, and very rarely this is relevant. According to these considerations, the best score would be obtained assigning an higher weight to the premises, and a lower one to the body and the conclusions. To choose the best values, we wrote a Python program that automatically calculates, using trec_eval, the nDCG@5 among all possibilities of weights (to each field) from 0 to 1, with a step of 0.25. The best five combinations of weights are in the Tab. 4, using BM25 as similarity, and in the Tab. 5, using LMDirichlet similarity. In both cases we pre-processed the documents using the best options obtained in Sec. 3.1.1 and Sec. 3.1.2: no stemmers and EBSCO stoplist. In 12 https://lucene.apache.org/core/8_8_1/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemFilter.html 13 https://lucene.apache.org/core/8_8_1/analyzers-common/org/apache/lucene/analysis/en/KStemFilter.html 14 https://lucene.apache.org/core/8_8_1/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html the Tab. 9 in Sec. 7 are listed all the tried combinations for both similarities. These results are in agreement with the previous considerations and they confirm our theories. Body Premises Conclusions nDCG@5 0.0 1.0 0.25 0.4150 0.25 1.0 0.25 0.4143 0.5 1.0 0.25 0.4032 0.5 0.75 0.25 0.4029 0.25 0.75 0.25 0.4023 Table 4: nDCG@5 scores obtained with different fieds’ weights and BM25. Body Premises Conclusions nDCG@5 0.25 1 0 0.7379 0 1 0 0.7345 0.25 0.75 0 0.7331 0.5 1 0 0.7239 0.5 0.75 0 0.7123 Table 5: nDCG@5 scores obtained with different fields’ weights and LMDirichlet. 3.3. Query Expansion Query expansion is a technique used to match more relevant documents, by expanding or reformulating the basic search query. To improve the retrieval performances of our model we tried to integrate query expansion in our IR system by adding to a query all the synonyms of the terms that are left after the pre-processing phase. In particular, we decided to use WordNet: a lexical database of semantic relations between words. In fact the SynonymMap15 object of the WordNet package allows to load the file downloaded from WordNet16 into an hash map that can be used for fast high-frequency lookups of synonyms. We decided to assign a different weight to the synonyms added at query time to give them more or less importance in the search. We tried different values and the results are reported in Tab. 6. We pre-processed the documents using the best options obtained in Sec. 3.1.1 and Sec. 3.1.2: no stemmers and EBSCO stoplist. As can be noticed, using BM25 similarity with a weight of 0.4 to the synonyms there is an increase of the evaluated score. On the contrary, using LMDirichlet similarity adding synonyms brings no improvement. This probably is caused by an increment of noise that causes matches with non relevant documents, decreasing the final score. 15 https://lucene.apache.org/core/8_8_1/api/contrib-wordnet/org/apache/lucene/wordnet/SynonymMap.html 16 https://wordnet.princeton.edu/download nDCG@5 Synonyms Weight BM25 LMDirichlet No synoynms 0.3938 0.7345 0.1 0.4113 0.6986 0.2 0.4159 0.6483 0.3 0.3973 0.5913 0.4 0.3898 0.5267 0.5 0.3764 0.4731 0.6 0.3596 0.4273 0.7 0.3304 0.3847 0.8 0.2931 0.3406 0.9 0.2584 0.2892 1.0 0.2253 0.2564 Table 6: nDCG@5 scores obtained with different weight to synonyms in query expansion. 3.4. Re-ranking In the last step, we re-ranked the top 30 documents retrieved from the previous phase performing a sentiment analysis on the arguments. To perform the analysis we used the VADER tool [15] and in particular the Java port provided by Animesh Pandey on Github17 . This tool allows to compute a value between -1 and 1 for each argument. Values greater than 0 represent a positive sentiment from the author, while values lower than 0 indicate negativity. The values that are closer to 0 express neutral sentiment. We decided to try two different approaches to do the re-ranking: 1. Promote emotional documents combining the score from the previous phase with the sentiment analysis score, using Eq. 1 1 2 * 𝑆𝑐𝑜𝑟𝑒 + * |𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡| * 𝑆𝑐𝑜𝑟𝑒 (1) 3 3 2. Promote neutral documents instead of emotional ones, using Eq. 2 1 2 * 𝑆𝑐𝑜𝑟𝑒 − * |𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡| * 𝑆𝑐𝑜𝑟𝑒 (2) 3 3 We decided to give more importance to the sentiment score, using an higher value in Eq. 1 and Eq. 2, since using lower values we could not observe any improvements. We tried to re-rank both with sentiment score computed on premises and on conclusions to see which strategy could be the right one. The results are provided in Tab. 7. We pre-processed the documents using the best options obtained in Sec. 3.1.1 and Sec. 3.1.2: no stemmers and EBSCO stoplist. 17 https://github.com/apanimesh061/VaderSentimentJava As we can see, with the sentiment scores computed on the conclusion the scores decrease drastically with both approaches and with both models. This is probably due to the fact that conclusions are often composed by few words (sometimes only one) and so the sentiment score doesn’t perfectly express the true sentiment of the author. Using the sentiment scores computed on the premises, the scores drops almost to zero when we give more importance to neutral documents, while we can see a little improvement of the score (BM25) and almost the same value (LMDirichlet) when we promote emotional documents. Hence we can affirm that an higher absolute value of the sentiment score leads to better argumentation and so to a general high relevance of the retrieved documents. nDCG@5 Sentiment on premises Sentiment on conclusions BM25 LMDirichlet BM25 LMDirichlet No sentiment 0.3938 0.7345 0.3938 0.7345 Neutral is better 0.0811 0.0569 0.0811 0.0569 Emotional is better 0.4362 0.6952 0.1423 0.1414 Table 7: nDCG@5 scores obtained with BM25 and LMDirichlet similarities and different con- figurations of sentiment analysis. 4. Experimental Setup Touchè Task 1 offers us the possibilities to access the args.me corpus via the API of args.me search engine or downloading the file containing all the documents. We decided to download the entire corpus and in the particular the version updated to 2020-04-01. For the Touchè Task 1 we also used TIRA platform [16] to submit and evaluate our model. Indeed, a working implementation of your approach is available in TIRA. 4.1. Data Description The updated version of the args.me corpus contains 387,740 arguments crawled from four debate portals (debatewise.org, idebate.org, debatepedia.org, and debate.org), and 48 arguments from Canadian parliament discussions. The arguments were extracted using heuristics that are designed for each debate portal. Each argument is identified by an ID an it is constituted by a conclusion and one or more premises. For each document there are also present some information about the context like the source URL, the title of the discussion and many others. For what concern the topics, Touché Lab provided us 50 controversial topics (the query potentially issued by a user), Each topic has both pro and con relevant arguments present in the document collection. 4.2. Evaluation measures We used the Normalized Discounted Cumulated Gain (nDCG) [17] score with an evaluation depth of 5 since this is the same evaluation measure used by Touché Lab to evaluate our runs. In particular, we used the implementation provided by the trec_eval library18 , to measure the performance of our IR system. The nDCG is the result of Eq. 3. Parameter 𝑏 indicates the patience of the user in scanning the result lists, and usually it is a value of 2 for an impatient user, or 10 for a patient user. To compute the Discounted Cumulated Gain (DCG) score, trec_eval uses as parameter 𝑏 the value of 2. Since the result is not bounded in [0,1], it is necessary normalize the score dividing nDCG by the Ideal Discounted Cumulated Gain (iDCG), provided by Touché Lab, as can be seen in Eq. 4. The iDCG can be obtained sorting all relevant documents in the corpus by their relative relevance, and producing the maximum possible DCG through position 5. 5 ∑︁ 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒𝑛 𝐷𝐶𝐺@5 = (3) max (1, log𝑏 (𝑖 + 1)) 𝑛=1 𝐷𝐶𝐺@5 𝑛𝐷𝐶𝐺@5 = (4) 𝑖𝐷𝐶𝐺@5 5. Results and Discussion 5.1. Results As previously mentioned, we first tried all the different techniques separately to choose the best parameters/weights for each method, then we merged all together using the three strategies at the same time. In all cases we pre-processed the documents using the best options obtained in Sec. 3.1.1 and Sec. 3.1.2: no stemmers and EBSCO stoplist. In Tab. 8 we reported the best score achieved for each method and in the last line we show the score obtained by using all the techniques. nCDG@5 BM25 LMDirchlet Base 0.3938 0.7345 Best different fields weights 0.4698 0.8026 Best query expansion with synonyms 0.4159 0.6986 Best re-ranking with sentiment analysis 0.4362 0.6952 Merging all three strategies 0.4521 0.6661 Table 8: nDCG@5 final scores obtained with the different presented strategies. Looking at Tab. 8 we can notice that with the BM25 similarity all the three methods worked well increasing the score of the base case, but the score achieved with the combination of all the three techniques is not the best one. For what concerns the LMDirichlet model we can see 18 https://trec.nist.gov/trec_eval/ that the final score is lower than the base one, while the best is the one that doesn’t use query expansion and sentiment analysis. Hence, these two techniques have not worked well with the LMDirichlet model and so they leads to lower performance when merging all the techniques. The score achieved with the query expansion is very low and it’s almost half of the base one. One possible explanation to this phenomenon is the fact that there are too many synonyms for each word and this introduces noise that degrades the performance of the search. In fact according to previous studies [18] there is no way using only WordNet to select an appropriate subset of synonyms. The sentiment analysis leads to a very small improvement for BM25 but for LMDirichlet the score is almost the same. Looking manually at the documents retrieved after the first phase, we discovered that almost all the documents in the top positions have an high sentiment value. According to this, the re-ranking probably doesn’t work very well because the documents with an higher sentiment value are already marked as the most relevant ones. Hence it isn’t possible to improve the nDCG@5 score because the top ranked documents are also the ones with the higher sentiment score. Finally, we can affirm that the LMDirichlet model is better than BM25 for argument retrieval confirming the results obtained by the teams of the previous edition of Touché Argument Retrieval Lab [4]. 6. Conclusions and Future Work We implemented our IR system to retrieve the most relevant arguments to the given queries provided in the Touché shared task. We used both BM25 and LMDirichlet models, but we demonstrate that LMDirichlet is much better for what concern argument retrieval. We also show how much important is to give the right weight to the different parts of a document, since a lot of information can be useless during the search. Anyway there are some aspects that can be improved to reach better performances. For example, instead of expanding the queries simply adding all the synonyms of a specific word, it would be better to associate a score to each synonym to indicate how much the two word are similar, and then use it to weight the different synonyms while performing the query. Another improvement can be done by using a better formula to re-rank the documents or maybe using a different score instead of the one retrieved with sentiment analysis. For example, with a machine learning approach it would be possible to train a model to assign a quality score to each argument and then use this value to re-rank the top retrieved documents. To conclude, we presented our approach to the problem of argument retrieval and we think that in the future always better solutions will be presented, especially with the help of machine learning. References [1] C. Lumer, Walton’s argumentation schemes, OSSA Conference Archive (2016). [2] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument Retrieval, in: Working Notes Papers of the CLEF 2021 Evaluation Labs, CEUR Workshop Proceedings, 2021. [3] H. Wachsmuth, N. Naderi, Y. Hou, Y. Bilu, V. Prabhakaran, T. A. Thijm, G. Hirst, B. Stein, Computational argumentation quality assessment in natural language, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 2017, pp. 176–187. [4] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, et al., Overview of touché 2020: Argument retrieval, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2020, pp. 384–395. [5] H. Wachsmuth, M. Potthast, K. Al-Khatib, Y. Ajjour, J. Puschmann, J. Qu, J. Dorsch, V. Morari, J. Bevendorff, B. Stein, Building an argument search engine for the web, in: Proceedings of the 4th Workshop on Argument Mining, Association for Computational Lin- guistics, Copenhagen, Denmark, 2017, pp. 49–59. URL: https://www.aclweb.org/anthology/ W17-5106. doi:10.18653/v1/W17-5106. [6] M. Potthast, L. Gienapp, F. Euchner, N. Heilenkötter, N. Weidmann, H. Wachsmuth, B. Stein, M. Hagen, Argument search: Assessing argument relevance, in: Proceedings of the 42nd In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 1117–1120. URL: https://doi.org/10.1145/3331184.3331327. doi:10.1145/3331184.3331327. [7] C. Staudte, L. Lange, Sentarg: A hybrid doc2vec/dph model with sentiment analysis refinement, Methodology 1 (2020) 2. [8] C. Akiki, M. Potthast, Exploring argument retrieval with transformers, Working Notes Papers of the CLEF (2020). [9] A. Radford, K. Narasimhan, Improving language understanding by generative pre-training, 2018. [10] M. Bundesmann, L. Christ, M. Richter, Creating an argument search engine for online debates (2020). [11] M. S. Shahshahani, J. Kamps, University of amsterdam at clef 2020 (2020). [12] L. Dumani, R. Schenkel, Ranking arguments by combining claim similarity and argument quality dimensions, Argument 2696 (2020). URL: http://ceur-ws.org/Vol-2696/paper_174. pdf. [13] H. Wachsmuth, N. Naderi, Y. Hou, Y. Bilu, V. Prabhakaran, T. A. Thijm, G. Hirst, B. Stein, Computational argumentation quality assessment in natural language, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Association for Computational Linguistics, Valencia, Spain, 2017, pp. 176–187. URL: https://www.aclweb.org/anthology/E17-1017. [14] A. Jivani, A comparative study of stemming algorithms, Int. J. Comp. Tech. Appl. 2 (2011) 1930–1938. [15] C. Hutto, E. Gilbert, Vader: A parsimonious rule-based model for sentiment analysis of social media text, in: Proceedings of the International AAAI Conference on Web and Social Media, volume 8, 2014. [16] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/ 978-3-030-22948-1\_5. [17] K. Järvelin, J. Kekäläinen, Cumulated Gain-Based Evaluation of IR Techniques, ACM Transactions on Information Systems (TOIS) 20 (2002) 422–446. [18] D. Parapar, A. Barreiro, D. E. Losada, Query expansion using wordnet with a logical model of information retrieval., IADIS AC 2005 (2005) 487–494. 7. APPENDIX A : Table with all combinations of field’s weights with BM25 and LMDirichlet Body Premises Conclusions nDCG@5 BM25 nDCG@5 LMDirichlet 0 0 0 0.0024 0.0024 0 0 0.25 0.153 0.1618 0 0 0.5 0.153 0.1618 0 0 0.75 0.153 0.1618 0 0 1 0.153 0.1618 0 0.25 0 0.3938 0.7345 0 0.25 0.25 0.3191 0.6147 0 0.25 0.5 0.2954 0.5228 0 0.25 0.75 0.2835 0.4491 0 0.25 1 0.2631 0.414 0 0.5 0 0.3938 0.7345 0 0.5 0.25 0.3827 0.6606 0 0.5 0.5 0.3191 0.6147 0 0.5 0.75 0.3035 0.5709 0 0.5 1 0.2954 0.5228 0 0.75 0 0.3938 0.7345 0 0.75 0.25 0.3996 0.6829 0 0.75 0.5 0.3516 0.6524 0 0.75 0.75 0.3191 0.6147 0 0.75 1 0.3061 0.5947 0 1 0 0.3938 0.7345 0 1 0.25 0.415 0.6849 0 1 0.5 0.3827 0.6606 0 1 0.75 0.3438 0.6455 0 1 1 0.3191 0.6147 0.25 0 0 0.3309 0.6513 0.25 0 0.25 0.2562 0.5294 0.25 0 0.5 0.2397 0.4395 0.25 0 0.75 0.2326 0.4001 0.25 0 1 0.233 0.3596 0.25 0.25 0 0.3875 0.7095 0.25 0.25 0.25 0.351 0.6345 0.25 0.25 0.5 0.3036 0.585 0.25 0.25 0.75 0.2902 0.5395 0.25 0.25 1 0.2757 0.4881 0.25 0.5 0 0.3817 0.7239 0.25 0.5 0.25 0.3773 0.6605 0.25 0.5 0.5 0.3362 0.6278 Table continued from previous page Body Premises Conclusions nDCG@5 BM25 nDCG@5 LMDirichlet 0.25 0.5 0.75 0.3094 0.5938 0.25 0.5 1 0.2992 0.5648 0.25 0.75 0 0.3955 0.7331 0.25 0.75 0.25 0.4023 0.685 0.25 0.75 0.5 0.3672 0.6445 0.25 0.75 0.75 0.3363 0.6269 0.25 0.75 1 0.313 0.6002 0.25 1 0 0.3959 0.7379 0.25 1 0.25 0.4143 0.6903 0.25 1 0.5 0.3741 0.6603 0.25 1 0.75 0.3524 0.6411 0.25 1 1 0.3308 0.6271 0.5 0 0 0.3309 0.6513 0.5 0 0.25 0.2793 0.5823 0.5 0 0.5 0.2562 0.5294 0.5 0 0.75 0.2444 0.4877 0.5 0 1 0.2397 0.4395 0.5 0.25 0 0.3878 0.6962 0.5 0.25 0.25 0.3548 0.6423 0.5 0.25 0.5 0.3228 0.6058 0.5 0.25 0.75 0.2969 0.5631 0.5 0.25 1 0.2853 0.5292 0.5 0.5 0 0.3875 0.7095 0.5 0.5 0.25 0.3841 0.6624 0.5 0.5 0.5 0.351 0.6345 0.5 0.5 0.75 0.3266 0.6094 0.5 0.5 1 0.3036 0.585 0.5 0.75 0 0.3827 0.7123 0.5 0.75 0.25 0.4029 0.6698 0.5 0.75 0.5 0.3654 0.6462 0.5 0.75 0.75 0.34 0.6314 0.5 0.75 1 0.3239 0.608 0.5 1 0 0.3817 0.7239 0.5 1 0.25 0.4032 0.6896 0.5 1 0.5 0.3773 0.6605 0.5 1 0.75 0.3658 0.6384 0.5 1 1 0.3362 0.6278 0.75 0 0 0.3309 0.6513 0.75 0 0.25 0.2881 0.6014 0.75 0 0.5 0.2776 0.5608 Table continued from previous page Body Premises Conclusions nDCG@5 BM25 nDCG@5 LMDirichlet 0.75 0 0.75 0.2562 0.5294 0.75 0 1 0.2467 0.5082 0.75 0.25 0 0.3783 0.6887 0.75 0.25 0.25 0.3569 0.6451 0.75 0.25 0.5 0.3355 0.6095 0.75 0.25 0.75 0.3121 0.5785 0.75 0.25 1 0.2876 0.5584 0.75 0.5 0 0.3874 0.6989 0.75 0.5 0.25 0.384 0.6645 0.75 0.5 0.5 0.359 0.6273 0.75 0.5 0.75 0.3335 0.6187 0.75 0.5 1 0.3122 0.5941 0.75 0.75 0 0.3875 0.7095 0.75 0.75 0.25 0.3999 0.6766 0.75 0.75 0.5 0.3685 0.6531 0.75 0.75 0.75 0.351 0.6345 0.75 0.75 1 0.3302 0.6186 0.75 1 0 0.3833 0.7108 0.75 1 0.25 0.4022 0.6913 0.75 1 0.5 0.3796 0.6622 0.75 1 0.75 0.3698 0.637 0.75 1 1 0.3461 0.6316 1 0 0 0.3309 0.6513 1 0 0.25 0.2982 0.6131 1 0 0.5 0.2793 0.5823 1 0 0.75 0.2689 0.5492 1 0 1 0.2562 0.5294 1 0.25 0 0.3758 0.6804 1 0.25 0.25 0.3531 0.6529 1 0.25 0.5 0.3425 0.6171 1 0.25 0.75 0.312 0.6001 1 0.25 1 0.3046 0.5734 1 0.5 0 0.3878 0.6962 1 0.5 0.25 0.3812 0.6661 1 0.5 0.5 0.3548 0.6423 1 0.5 0.75 0.3465 0.6204 1 0.5 1 0.3228 0.6058 1 0.75 0 0.3894 0.7093 1 0.75 0.25 0.3986 0.673 1 0.75 0.5 0.3656 0.6548 Table continued from previous page Body Premises Conclusions nDCG@5 BM25 nDCG@5 LMDirichlet 1 0.75 0.75 0.3625 0.6281 1 0.75 1 0.3377 0.619 1 1 0 0.3875 0.7095 1 1 0.25 0.3995 0.687 1 1 0.5 0.3841 0.6624 1 1 0.75 0.3657 0.644 1 1 1 0.351 0.6345 Table 9: nDCG@5 scores obtained with all combinations of field’s weights both for BM25 and LMDirichlet.