SEUPD@CLEF: Team hextech on Argument Retrieval for Comparative Questions. The importance of adjectives in documents quality evaluation Notebook for the Touché Lab on Argument Retrieval at CLEF 2022 Alessandro Chimetto1 , Davide Peressoni1 , Enrico Sabbatini1 , Giovanni Tommasin1 , Marco Varotto1 , Alessio Zanardelli1 and Nicola Ferro1 1 University of Padua, Italy Abstract This report explains our approach to solve the Task 2 challenge about Argument Retrieval for Comparative Questions proposed by the third Touché lab on argument retrieval at CLEF 2022. Given a comparative topic, the task is to retrieve relevant argumentative passages from a collection of documents for either compared object or for both. Our approach follows the Information Retrieval pipeline that is: the parsing of the document collection, the indexing of the parsed documents by using an analyzer with commons filters, and the query matching using a retrieval model. In addition, we implemented an index field that catches the overall quality of each passage for our specific task. From an analysis of our results, the implementation of the quality field slightly improves the ranking of the retrieved documents. Keywords Argument Retrival, Lucene, Comparative questions, CLEF 2022 Task 2 1. Introduction The Touché Lab is organized by the CLEF, which is a large-scale evaluation initiative for the IR (Information Retrieval) task for the European community. As a team from the Search Engines course of the master degree in Computer Engineering at the University of Padua, we are participating, with the team name of Captain Tempesta, to the 2022 Touché challenge Task 2 [1, 2], which is focused on comparative questions. The scope of this task is to retrieve useful information from a document collection that answer the aforementioned comparative question which are stored in the topic file. The corpora we have used was provided by the Touché Lab organizers, and it is composed by 0.9 million text passages taken from a ClueWeb12 [3] CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ alessandro.chimetto.1@studenti.unipd.it (A. Chimetto); davide.peressoni@studenti.unipd.it (D. Peressoni); enrico.sabbatini@studenti.unipd.it (E. Sabbatini); giovanni.tommasin@studenti.unipd.it (G. Tommasin); marco.varotto.3@studenti.unipd.it (M. Varotto); alessio.zanardelli@studenti.unipd.it (A. Zanardelli); ferro@dei.unipd.it (N. Ferro) € http://www.dei.unipd.it/~ferro/ (N. Ferro)  0000-0001-9219-6239 (N. Ferro) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) collection. Our IR System is based on various types of word frequencies that capture the quality of the English phrase syntax. As a consequence we have introduced an attribute describing the overall syntax quality for each document. The types of word frequencies we used are: • Symbol frequency: the frequency of the symbols with respect to the total number of words in the passage; • Words length frequency: difference between the frequency of short words and the frequency of long words with respect to the total number of words in the passage; • Comparative adjective frequency: the frequency of the comparative adjectives with respect to the total number of words in the passage; • Adjective frequency: the frequency of the total number of adjective, both comparative and descriptive, with respect to the total number of words in the passage. The main purpose of the introduction of these frequencies during the indexing phase is to re-rank the retrieved documents. Moreover, according to the different calculated frequencies, the passage should be placed lower in the ranking if there are plenty of symbols, or the words length does not follow the least effort principle. At the same time, the passage should appear at top ranks if it contains many comparative and descriptive adjectives. The paper is organized as follows: • Section 2 describes our methodology and our work flow to solve the task; • Section 3 explains our experimental setup; • Section 4 discusses our experimental results; • Section 5 reports the conclusions. 2. Related Work We started our project from the basic Retrieval System in the Search Engines course repository: inside it there are various project examples useful to learn about Information Retrieval. The project skeleton is based on the Hello-IR example that provides the basics of the Lucene library, e.g. the standard analyzer and a query searcher. Furthermore, we improved our analyzer by taking suggestions from the Hello-Analyzer example, in particular, we adopted the POS recognizer and the Lovins Stemmer filter [4]. Finally, we used the BM25 similarity score as suggested by the literature [5, 6]. To improve the basic Retrieval System, we used the query expansion technique by adding synonyms to the query[7, 8, 9]. Moreover, we take inspiration from [9] and we tried to capture the importance of the different types of adjectives for each document. 3. Methodology The methodology we adopted to build our IR System is based on basic English sentence analysis. We started by making two assumptions: the first one is related to the correct syntax and the Parser: Parse documents Token Stream Token Stream Analyzer: Filter the tokens Compute the frequencies Body Quality Indexer: Save it in the index Figure 1: A simple view of the offline phase of the system architecture. correct structure of an English sentence [10]. The second assumption is strictly related to the Touché task we are facing, the argument retrieval for comparative questions. According to the aforementioned assumptions, a document acquires importance if it contains an appropriate number of adjectives, in particular the comparative ones. Indeed, to properly describe or compare different subjects, the sentence has to contain one or more adjectives. Moreover, the same document is not informative if it contains an elevate number of symbols with respect to the number of words. Instead the document is supposed to be an informative and readable one if it follows the Zipf’s least effort principle [11, 12], which states that words of short length are more common than long words. From the second assumption, being the adjective important in this task, we removed all of them from the stoplist, in order to not lose significative and informative tokens. Moreover we added to the query the synonyms of each adjective and other possible key words. For the remaining part of the IR System we adopted the usual pipeline offered by the Lucene framework, as can be seen in Figure 1. Now we will describe this architecture in details. 3.1. System architecture We used the Java programming language and the Lucene library [13] to implement and to develop our IR System. The Parser, the Indexer, and the Searcher were build upon examples seen during the lectures. The system architecture can be divided in the following stages: Parser To parse the topics file we used the org.w3c.dom package [14] of the Java standard library, which has a function to parse the XML file in which the list of topics is stored. To parse the passages file we used the java.util.zip package [15] from the Java standard library to decompress the gzipped file in real time. It is chained to the Json parsing method offered by the Gson library [16]. 3.1.1. Analyzer The source of our Analyzer is the standard Tokenizer. Then it maps the returned token stream applying filters in the following order: 1. Stop filter: it removes the most common English words that are not informative [17]. This filter works in a case insensitive mode to remove also capitalized and case mistyped words. Moreover it does not remove the adjectives, as stated in section 3. 2. Lowercase or brand filter: for what concerns letter case, we convert all the tokens to lowercase in order to match the same token in all possible writing combinations. In fact, normally, brand and product names can be converted to lowercase without losing information. However, there is an exception for all the words which refer to famous brands [18], and at the same time to common English words [19] (e.g. Apple the brand and apple the fruit). In this exceptional case we do not perform any token modification. It is important for this task to preserve brand names information, since a lot of queries involves comparison between brands or products. 3. Lowercase copy filter: it integrates the previous filter taking care of initial capitals and possible writing errors, which could be misled with product names. To overcome these situations, for each non-lowercase token,1 we duplicate the non-lowercase tokens: one copy will remain as the original, the other will become lowercase. Finally we obtain a token for the name and a token for the word. The last two filters were written in a separated way to possibly allow the insertion of grammatical analysis filter. This is required since the token duplication alters the structure of the sentences. 4. Lovins stem filter: this is a famous stem filter, designed by Julie B. Lovins, which produces words stripped by their suffixes [4]. 3.1.2. Indexer We used the standard Lucene Indexer with the BM25 similarity [5, 6]. The standard Lucene Indexer starts by creating an inverted index. This type of index is called inverted index because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages). In the next phase, the Indexer will populate the inverted index by analyzing the passages one by one using an analyzer. In practice we extended the basic analyzer, as described in section 3.1.1. During the indexing phase, for each document the quality score is computed (as said in section 1). During the search phase, the quality score is multiplied to the query score, with the objective of re-ranking the retrieved documents. The aim is to penalize bad written documents and to promote comparative passages with respect to descriptive ones. The quality is internally represented by a convex combination of the following frequencies: Symbol frequency A document with plenty of not informative symbols (e.g. #, emoji, . . . ) is penalized because it could be a bad written passage, and it usually refers to click bait, 1 Thanks to the previous filter, they refer to both words and names. scam or promotional pages. Some symbols brings no penalty if they are in an appropriate quantity (e.g. !, ?, . . . ): the penalty of each of those symbols is computed as 1 − 1/𝑛 where 𝑛 is the 𝑛th occurrence of the symbol. All characters, except ASCII intervals ,-;, A-Z, a-z and ’ ’, are considered symbols. At the following symbols we assigned the increasing penalty 1 − 1/𝑛, instead of the fixed one (1): ?, %, $, &, *, +, /, <, =, >, @, _, ", ’, (, ), [, ]. Words length frequency To assure that the document follows the Zipf’s least effort principle, we compute the difference between the short words frequency and the long words one. We consider as short tokens the words of length less or equal than four. Finally we rescale the aforementioned difference between 0 and 1. Adjective frequency A document, in order to be descriptive or comparative, it must contain adjectives. To capture this property we compute the frequency of adjectives with respect to the total number of words. Intuitively, the higher the frequency, the more descriptive is the document. Comparative adjective frequency As the same reasoning of the previous frequency, the higher the frequency, the more descriptive is the document. To achieve this we compute the ratio between the number of comparative adjective divided by the total number of adjectives. The last two contributions have a lighter weights in the convex combination, in fact these frequencies distinguish between two types of good documents, preferring comparative ones. To classify different adjectives we prepared two different lists, the first list contains the comparative adjectives and the second contains the descriptive ones. The adjectives were taken from an online dictionary.2 Eventually, we add a bias to the convex combination as form of smoothing, to avoid bringing final scores to zero. 3.1.3. Searcher A part from the aforementioned re-ranking (see section 1) by document quality, we used the standard Lucene IndexSearcher with the same configuration of the Indexer, that is we used the BM25 similarity measure related to the Vector Space Model (VSM). Lucene scoring uses a combination of the VSM of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User’s query. We used a basic and common similarity measure to calculate the relevance score because we focused more on the quality score of the passages. For what concerns relevance evaluation, we used the boolean model to build a single query composed by the following sub-queries, each appropriately boosted and using the boolean clause should (logical or): 2 https://www.dictionary.com 1. The first sub-query contains all the terms returned by the same Analyzer used in the Indexer (section 3.1.1). We assigned the highest boost to this query because it represents the user information need. 2. In the second sub-query we expand the previous one adding the synonyms of the terms, taken from a list [20]. 3. The last sub-query contains only the N-grams detected by the POS analyzer. In practice we perform a part of speech analysis by using the openNLP libraries [21] on the title of the topic to detect sequence of multiple nouns that together form an N-gram, N stand for the number of nouns in the sequence. Even if for each topic we return 1 000 ranked documents, we decided to retrieve 10 times that number, because when we multiply the query score by the quality one, there is the possibility that some documents ranked at a position greater than 1 000 would climb the ranking. Without this expedient, those documents would not be considered at all. 4. Experimental Setup In this section we describe the experimental setup of our system, starting with the data provided by the CLEF lab for the specific Task 2. After that we describe how we measured our IR System effectiveness by assessing a sample of the retrieved documents. Finally, a description of our repository and of the hardware we used. 4.1. Data Description The CLEF organization provided us: • The corpus: about 0.9 million passages taken from ChatNoir dataset [22]. Each passage is organised in a JSON object containing the identifier, the body, and the link to the ChatNoir collection. • The topics: each of the 50 topics is an XML entry composed by several tags, namely the number, the title, the description, and the narrative. 4.2. Evaluation measures To overcame the lack of the qrels file, we assessed a sample formed by the first 5 ranked documents of 10 randomly chosen topics. We gave a score of 0 for the documents not containing any useful information, 1 for descriptive documents and 2 for well written comparative passages. From this we computed the nDCG score [23], which is used to evaluate multi-graded rankings. 4.3. Repository Organization The Git repository of our project is available at the following url: https://bitbucket.org/upd-dei-stud-prj/ seupd2122-hextech. The repository is located on bitbucket.org and contains the source code, the experiments and the results. The directory code contains all the necessary classes for build and test our project. In the root there is a file named run.sh which allows an easy run on TIRA [24]. 4.4. Hardware used To test our system, we used a personal machine, with an AMD Ryzen 9 3950X 16-Core CPU @ 3.49 GHz and 16 GIB of RAM running Windows 10 x64. For the submission we uploaded (through SSH) our retrieval models into a dedicated TIRA virtual machine. 4.5. Script used To divide comparative from descriptive adjectives we created a script. Such script connects to www.dictionary.com through an HTTP call; given the response from the website, the script inserts the adjective into the corresponding file. 5. Results and Discussion In this section we provide a summary of the performance and results of our system, starting with the quality of the documents, and then the comparison between the ranking of retrieved documents. Finally we will examine the relevant issues we encountered. 5.1. Quality To validate our hypothesis, we compared three types of passages with different quality. The first passage has a low quality (0.302), and in facts it does not bring useful information: ~> sup_eq_neg_inf meet_eq_neg_join ~> inf_eq_neg_sup add_eq_meet_join ~> add_eq_inf_sup meet_0_imp_0 ~> inf_0_imp_0 join_0_imp_0 ~> sup_0_imp_0 meet_0_eq_0 ~> inf_0_eq_0 join_0_eq_0 ~> sup_0_eq_0 neg_meet_eq_join ~> neg_inf_eq_sup neg_join_eq_meet ~> neg_sup_eq_inf join_eq_if ~> sup_eq_if mono_meet ~> mono_inf mono_join ~> mono_sup meet_bool_eq ~> inf_bool_eq join_bool_eq ~> sup_bool_eq meet_fun_eq ~> inf_fun_eq join_fun_eq ~> sup_fun_eq meet_set_eq ~> inf_set_eq join_set_eq ~> sup_set_eq meet1_iff ~> inf1_iff meet2_iff ~> inf2_iff meet1I ~> inf1I meet2I ~> inf2I meet1D1 ~> inf1D1 meet2D1 ~> inf2D1 meet1D2 ~> inf1D2 meet2D2 ~> inf2D2 meet1E ~> inf1E meet2E ~> inf2E join1_iff ~> sup1_iff join2_iff ~> sup2_iff join1I1 ~> sup1I1 join2I1 ~> sup2I1 join1I1 ~> sup1I1 join2I2 ~> sup1I2 join1CI ~> sup1CI join2CI ~> sup2CI join1E ~> sup1E join2E ~> sup2E is_meet_Meet ~> is_meet_Inf Meet_bool_def ~> Inf_bool_def Meet_fun_def ~> Inf_fun_def Meet_greatest ~> Inf_greatest Meet_lower ~> Inf_lower Meet_set_def ~> Inf_set_def Sup_def ~> Sup_Inf Sup_bool_eq ~> Sup_bool_def Sup_fun_eq ~> Sup_fun_def Sup_set_eq ~> Sup_set_def listsp_meetI ~> listsp_infI listsp_meet_eq ~> listsp_inf_eq meet_min ~> inf_min join_max ~> sup_max As we expected, the reported document has very long terms, a plenty of bad symbols and no adjectives. Next we report an higher quality (0.635) document: Without the breeder, there would be no dogs. Without the dogs, there would be no kennel clubs, no dog shows, no judges, no handlers, no trainers, no dog food companies, no dog publications. Despite their importance, breeders represent a very small segment of the dog world, which in turn, creates the dog business. Furthermore, they are the ones who seldom, if ever, make a profit, even in the most popular breeds; and since they cannot take a livelihood from their breeding activities, they must be able to rely on some other source of income. Why then, do people ever become Breeders?? A breeder has, in her mind, a perfect dog that she someday hopes to create. Compared with the low quality document, it follows the Zipf’s principle, it contains a properly number of symbols and an adeguate number of adjectives. Indeed, it is obvious that this passage is readable and offers much more information than the former. 5.2. Rank comparison The 6 runs we tested3 are presented in Table 1. The mean nDCG of our assessment are computed by considering only the pool of assessed passages instead of the whole corpus. Moreover we reported the Touché mean nDCG computed on the relevance judgments performed by the organizers. Run Lovins Quality weights Query boost Mean nDCG stemmer symbols adj. comparative Zipf’s bias q1 q2 q3 our Touché 1 no 1 0.7 0.4 0 1 1.3 1.2 1.25 0.866 0.589 2 no 1 0.7 0.4 0.6 1 1.3 1.2 1.25 0.720 0.593 3 no 1 0.8 0.6 0.6 1 1.35 1.25 1.2 0.747 0.584 4 yes 1 0.7 0.4 0.6 1 1.3 1.2 1.25 0.753 0.566 5 no - - - - - 1.3 1.2 1.25 0.733 0.597 6 no - - - - - 1.35 1.25 1.2 0.747 Table 1 Parameters used in the 6 runs. The last two do not use the document quality. We compared the mean of nDCG measures results (visible in Figure 2) and extrapolated the following considerations: • By comparing the first vs the fifth run, we can observe the importance of the quality: the latter has a lower mean nDCG caused by not taking into consideration documents quality. • The Lovins stem filter brings some improvements. This can be seen from the mean nDCG scores of the second and the fourth run. • By adding the Zipf’s weight, the effectiveness decreases: the mean nDCG of run 2 is lower than run 1. This unexpected result could be caused by a wrong calculation of this metric. Moreover the introduction of a new score component made the others less important. • Modifying the query boosts, the nDCG could improve, as can be seen in the third run with respect to the second. 3 We had to remove the last since Touché accepts up to 5 runs. (a) Run number 1. (b) Run number 2. (c) Run number 3. (d) Run number 4. (e) Run number 5. (f) Run number 6. Figure 2: nDCG with patience 2 for several topics. • In the last case the quality does not have such an impact: in fact in the sixth run, which differs from the third only for not using the quality, the results are the same. It could be that in this example the quality weights are not as influential as the query boosts. The previous analysis could be affected by a subjective classification of the document rele- vance, and according to us, especially for the first run which has a very deviated nDCG. This is clearly visible by comparing our mean nDCG with the Touché nDCG, in fact our metrics are always higher especially for the first run. Moreover, the official results are more flattered, stating that the runs are very similar as can be seen in section 6. 5.3. Relevant issues During the implementation of the quality score, we tested the following metric: the frequency difference of the two most frequent words. The idea was that in comparative documents this difference has to be small since the two compared objects would be nominated many times. Unfortunately this was not the case: many documents are affected by some noise and the most frequent words are not the subjects of the comparison. Another issue is that the quality not always represents what we expect, for example this bad written document achieve the highest quality (0.81): J Ul ’M r-, .-i I CO ("") () c Cf) ’"CJ :>-’0) .j. J ’M A .-i 0 0 ..e () A p~ ;:l OM 0 P UO X OJ CO != i ::>. : .-i .-i 0 0 l-l P IN! f;t:l F’l m Lf) CO Lf)\O N \0 NO\o r- .. CO \0 e-i r-.. N CO ..;t Or-.. 00r-.. .P Cf) .j. J .-i 0 0 ..e o Cf) ~ ~ ’U ’"CJ OJ OJ ’M ’M ~ ~ OM ’M PO r-, \0 ~ P 0 ~ .j. J .j. J P Q) P Q)’"CJ ’"CJ P OJ P .j. J Q) .j. In this case there are only short words, therefore our component referred to the Zipf’s principle incorrectly boosts the quality of the document. A better computation of this metric could solve the problem. 6. Statistical Analysis 0.7 0.6 Performance 0.5 0.4 0.3 0.2 0.1 Runs (decreasing order of mean performance) Figure 3: Box-plots of the average precision of each run. In the following we report the statistical analysis of the runs reported in the previous section. In Figure 3 we report the MAP (Mean Average Precision) of each run in a box-plot. We observe that the is no meaningful difference between the runs. Moreover, in Figure 4 we report the MAP calculated separately for each topic in a box-plot. From the last we observe that the majority part of the boxes have small performance variance, therefore, the runs have similar performances, given the topic, as can be further seen in Figure 3. One curious observation is that, the performance decrease as a linear function and this is problematic in the case we want to improve the search of low performance topics because there is not a marked division between high and low performance topic. One possibility to analyze our system is to check if the outliers for each topic belongs to the same run. We suppose that this is not the case since, as we said before, there is not a significant difference among the runs. 0.7 0.6 0.5 Performance 0.4 0.3 0.2 0.1 0 Topics (decreasing order of mean performance) Figure 4: Box-plot of the average precision of each topic. Source SS df MS F Prob>F Columns 0.0514 4 0.0128 0.4688 0.7586 Error 6.7148 245 0.0274 Total 6.7662 249 Table 2 Anova table. Finally, our last statement is further supported by the ANOVA test, visible in Figure 5, where you can see that, even if the run number four is slightly different from the others, they are close with respect to performance. In the Table 2, among the other results, it is reported the F statistic value of the ANOVA test. This measure tell us that there are no relevant differences between the tested run group as the measure is close to 1. To more support the F statistic value, we performed a student’s t-test on the runs 2 and 5 under the null hypothesis that the two runs are equals. By calculating the p-value, equal to 0.556 in the last test, we assure that there is not significant evidence against the null hypothesis. Click on the group you want to test run2 run3 run1 run5 run4 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 No groups have means significantly different from run2 Figure 5: Anova test. 7. Conclusions and Future Work In conclusion, by looking at the measures, we can say that the introduction of the document quality to re-rank the retrieved documents brings an improvement in terms of effectiveness. With respect of our previous considerations, the quality metric could be readjusted to be used also for other tasks. Moreover our IR System has similar performances also without the quality metric, probably our Indexer and Searcher work fine on their own. The normalized DCG is 72% for the worst run (2), and 86% for the best run (1). Thus, according to our relevance assessments, we can conclude our IR System seems to have good performances, but surely it can be improved. We would have liked to try the following strategies: Grammar analysis As said in section 3.1.1, we left the space in the Analyzer to add gram- matical analysis filters. Thanks to the sentence analysis it is possible to improve the search. Parameters tuning When the qrels will be available, it will be easier to tune the weights for the quality and the boosts for the query. Repeated sentences detection During the document quality computation, it is possibile to detect repeated sentences in order to penalize bad structured or spam documents. References [1] A. Bondarenko, M. Hagen, M. Fröbe, M. Beloucif, C. Biemann, A. Panchenko, Touché task 2: Argument retrieval for comparative questions, 2022. URL: https://webis.de/events/ touche-22/shared-task-2.html. [2] A. Bondarenko, M. Fröbe, J. Kiesel, S. Syed, T. Gurcke, M. Beloucif, A. Panchenko, C. Bie- mann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2022: Argu- ment Retrieval, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association (CLEF 2022), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2022, p. to appear. [3] J. Callan, The lemur project and its clueweb12 dataset, in: Invited talk at the SIGIR 2012 Workshop on Open-Source Information Retrieval, 2012. URL: https://www.lemurproject. org/clueweb12/. [4] J. B. Lovins, Development of a stemming algorithm., Mech. Transl. Comput. Linguistics 11 (1968) 22–31. [5] D. K. Harman, Overview of the third text retrieval conference (TREC-3), 500, DIANE Publishing, 1995. [6] K. S. Jonesa, S. Walkerb, S. Robertsonb, A probabilistic model of information retrieval: de- velopment and comparative experiments part 2, Information Processing and Management 36 (2000) 840. [7] A. Alhamzeh, M. Bouhaouel, E. Egyed-Zsigmond, J. Mitrović, Distilbert-based argumenta- tion retrieval for answering comparative questions, Working Notes of CLEF (2021). [8] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument Retrieval, in: K. Candan, B. Ionescu, L. Goeuriot, H. Müller, A. Joly, M. Maistro, F. Piroi, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. 12th International Conference of the CLEF Association (CLEF 2021), volume 12880 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2021, pp. 450–467. URL: https://link.springer.com/chapter/10.1007/978-3-030-85251-1_28. doi:10. 1007/978-3-030-85251-1\_28. [9] A. Bondarenko, M. Fröbe, J. Kiesel, S. Syed, T. Gurcke, M. Beloucif, A. Panchenko, C. Bie- mann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2022: Argu- ment Retrieval, in: M. Hagen, S. Verberne, C. Macdonald, C. Seifert, K. Balog, K. Nørvåg, V. Setty (Eds.), Advances in Information Retrieval. 44th European Conference on IR Re- search (ECIR 2022), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2022. [10] A. Radford, An introduction to English sentence structure, Cambridge university press, 2009. [11] G. K. Zipf, Human behaviour and the principle of least effort, adisson, Wesley Press, Cambridge. https://doi. org/10.1002/1097-4679 (195007) 6 (1949) 306. [12] J. K. Kanwal, Word length and the principle of least effort: language as an evolving, efficient code for information transfer (2018). [13] T. A. S. Foundation, Apache lucene, 2022. URL: https://lucene.apache.org/core/9_1_0/index. html. [14] Oracle, Package org.w3c.dom, 2022. URL: https://docs.oracle.com/en/java/javase/17/docs/ api/java.xml/org/w3c/dom/package-summary.html. [15] Oracle, Package java.util.zip, 2022. URL: https://docs.oracle.com/en/java/javase/17/docs/ api/java.base/java/util/zip/package-use.html. [16] Google, Module com.google.gson, 2022. URL: https://javadoc.io/doc/com.google.code.gson/ gson/latest/com.google.gson/module-summary.html. [17] I. Brigadir, Default english stop words from different sources, 2019. URL: https://github. com/igorbrigadir/stopwords. [18] A. Verma, M. Winkelmann, English words names brands place, 2016. URL: https://github. com/MatthiasWinkelmann/english-words-names-brands-places. [19] K. Atkinson, Spell checking oriented word lists (scowl), 2020. URL: http://wordlist.aspell. net/. [20] H. Robotics, hr-solr, 2019. URL: https://github.com/hansonrobotics/hr-solr. [21] T. A. S. Foundation, Apache opennlp, 2022. URL: https://opennlp.apache.org. [22] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl, in: L. Azzopardi, A. Hanbury, G. Pasi, B. Piwowarski (Eds.), Advances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2018. [23] K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of ir techniques, ACM Transactions on Information Systems (TOIS) 20 (2002) 422–446. [24] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/ 978-3-030-22948-1\_5.