SEUPD@CLEF: Team Kalu on improving Search Engine Performance with Query Expansion and Re-Ranking Approach Kimia Abedini1 , Akan Akysh1 , Arwa Fahoud1 and Nicola Ferro1 1 University of Padua, Italy Abstract This report provides a detailed description of the search engine system designed by Team KALU for the Conference and Labs of the Evaluation Forum (CLEF) LongEval LAB 2024 Task 1. The team, composed of students from the University of Padua, developed this system to efficiently index, search, and retrieve documents. We begin by outlining the problem and then go on to describe our system which mainly works on a collection of documents written in French language, then we explain the various methodologies we implemented. We present our experimental results and analyze them according to the techniques we employed. Finally, We present the outcomes of our experiments and discuss the different techniques used. Keywords Information Retrieval, Search Engines, Retrieve Documents, Query Expansion, LongEval, CLEF 1. Introduction Search engines have transformed people’s access to information. These systems proved that they stand as fundamental tools in people’s daily lives, providing a vast amount of data related to various aspects, from academic research and global news to everyday queries like shopping and local weather. However, the exponential growth in data available online presents a significant challenge for search engines in terms of storage, indexing, and retrieval. This paper introduces a solution to the issue by creating an information retrieval system capable of adjusting to evolving data while sustaining high performance. Our approach is implementing a search engine to address task 1 of the "LongEval" lab proposed by CLEF 2024, which aims to search a corpus of documents and retrieve the most relevant ones to a predefined set of queries gathered from Qwant[1]. The paper is structured as follows: Section 2 shows the related work we have started from; Section 3 outlines our approach; Section 4 explains our experimental setup; Section 5 discusses our main findings; and Section 7 concludes with reflections and future directions. 2. Related Work We used the paper by LongEval organizers [2] to understand the task and the datasets provided by CLEF. This helped us understand how the documents and queries were collected and what the main objectives of the task were. The paper also provided baseline performances, which we used to benchmark our system’s development. Based on the works of the CLOSE [3] and the FADERIC [4] teams, we chose to utilize query expansion techniques. We explored various methods, such as different Large Language Model (LLM)s and different prompts. Also, We build our re-ranking approach based on the work of JinaAi [5] which developed the jina-reranker model. CLEF 2024: Conference and Labs of the Evaluation Forum, September 9–12, 2024, Grenoble, France $ kimia.abedini@studenti.unipd.it (K. Abedini); akan.akysh@studenti.unipd.it (A. Akysh); arwa.fahoud@studenti.unipd.it (A. Fahoud); nicola.ferro@unipd.it (N. Ferro) € https://www.dei.unipd.it/~ferro/ (N. Ferro)  0000-0001-9219-6239 (N. Ferro) CEUR Workshop ceur-ws.org ISSN 1613-0073 © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Proceedings 3. Methodology In this section, we describe the steps we took to create the different components that comprise the search engine with the different configurations used with each component. Figure 1: Workflow of the IR system implemented by KALU. 3.1. Parser Parser processes the collection of documents, extracting valuable information and filtering out irrelevant data. we also use parser to extract the id and body of the documents. our parser consists of three classes which are ParsedDocumentclass, DocumentParser class and LongEvalDocumentParser class. Figure 2: Document structure without parsing. • ParsedDocument class represents a parsed document to be indexed. It has two attributes: ID for the unique identifier of the document and body for the document’s content. This class provides functionalities to set and retrieve documents’ attributes. • DocumentParser class represents an abstract class providing basic functionalities to iterate over the elements of a ParsedDocument, reading and parsing its content. • LongEvalDocumentParser class is the specific DocumentParser for the LongEval corpus. It provides an implementation of a parser for the documents in the TREC format. It reads the document and it replaces all the tags with space and returns ParsedDocument that contains the ID and the Body of the document. 3.2. Analyzer Analyzer is used to apply further processing steps to both the documents and queries. The Analyzer does the functionalities of "Tokenization", "Stemming" and filtering "Stop Words". • Tokenization is the process of splitting text into tokens (each token can be thought as a single word). We have used StandardTokenizer[6] in our project, which is one of the tokenizers provided by Apache Lucene. • Stemming is The process of removing suffixes or prefixes from words to obtain their root form. in this system we have tried using FrenchLightStemmer [7]. • Stop Words removal is the process of eliminating words that frequently appear in most doc- uments and carry little meaningful information for search queries.our stopwords list includes words extracted from Luke [8] which is a tool provided by Lucene [9] and also we used the stopword list in Kaggle [10]. Removing stopwords improved the search engine’s performance by reducing the index size and, consequently, decreasing the time needed to search the index. Figure 3: Analyzer Process. Also, we used FrenchElisionFilter [11] that is addresses elision phenomena in French, where certain characters, such as ’l’, ’d’, ’s’, ’t’, ’n’, and ’m’, followed by an apostrophe, are contracted by eliminating the apostrophe and associated character. 3.3. Indexer Creating an index is one of the main process where we generate a searchable database, known as an index, for parsed documents. This index holds crucial information about the documents such as the words and phrases they contain, their frequency, and their locations within the document. Indexing facilitates swift document retrieval by enabling users to search based on keywords or phrases. To accomplish this task, we developed the following components: • analyzer: is the analyzer to be used, which is LongEvalAnalyzer. • similarity: is an object needed to score the relevance of a document based on the query terms it contains. which can be implemented to be either Lucene default implementation that is based on a variant of the Term Frequency-Inverse Document Frequency (TF-IDF) model, or the modern alternative BM25 which is used in our case. • ramBufferSizeMB: the size in megabytes of the RAM buffer for indexing the documents. • indexPath: the path to the directory where the generated index should be stored. • docsPath: the path to the documents directory. • dpCls: an object of the DocumentParser which is responsible for parsing the documents in the collection. We used BM25 Similarity because unlike TF-IDF, where term frequency linearly affects the score, BM25 introduces a saturation point, which prevents the term frequency component from indefinitely influencing the score. BM25 scores a document based on the query terms appearing in it using the following formula: 𝑛 ∑︁ 𝑓 (𝑞𝑖 , 𝐷) · (𝑘1 + 1) BM25(𝐷, 𝑄) = IDF(𝑞𝑖 ) · (︁ )︁ |𝐷| 𝑖=1 𝑓 (𝑞𝑖 , 𝐷) + 𝑘1 · 1 − 𝑏 + 𝑏 · avgdl where 𝐷 is the document being scored, 𝑄 is the query consisting of words 𝑞1 , 𝑞2 , . . . , 𝑞𝑛 , 𝑓 (𝑞𝑖 , 𝐷) is the frequency of the term 𝑞𝑖 in document 𝐷, IDF(𝑞𝑖 ) is the inverse document frequency of term 𝑞𝑖 , |𝐷| is the length of the document, avgdl is the average document length in the text collection. Finally, 𝑘1 and 𝑏 are free parameters, we left 𝑘1 = 1.2 and 𝑏 = 0.75 same as most of the applications. After setting the configuration of the indexer, it does the following: • The indexer walks through the documents directory to find documents (specifically .txt files) and processes each file for indexing. • It uses DocumentParser class, which is tasked with parsing the documents and creating structured data ParsedDocument, including documents identifiers and bodies. • Each parsed document is converted into a Lucene Document object and added to the index. The document’s ID and body are stored as fields within the Lucene Document. The configurations used for the fields added to the Lucene Document are : • IDField: The ID field is created by only storing the document ID without storing term frequencies or positions, which are unnecessary for unique identifiers like document IDs, so only the original value of the field (the document ID) is stored directly in the index, allowing it to be retrieved when querying the index. • BodyField: The body field of the document is configured to store the terms resulted form splitting the body of the document into tokens, along with their frequencies, without storing the original text of the document in an attempt to reduce the size of the index. 3.4. Searcher The Searcher’s task is scanning indexed documents, analyzing user queries, and retrieving relevant information. It then presents a ranked list of documents that satisfy the user’s information needs. Our implementation does so by accepting the following parameters: • analyzer: in this case, an instance of our Analyzer. • similarity: we decided to use the BM25Similarity [12] function for the first stage of document retrieval due to its efficiency and higher effectiveness compared to other methods. • Run options: parameters for the index path, the topics path, the run path and the run name, the number of the expected topics, and the maximum number of documents retrieved (in our case 1000). • Search options: parameters for query boosting, query boosting value, number of documents to be re-ranked, score calculation mode, query expansion mode, and LLM used for generating the expansion. 3.4.1. Query Expansion Query Expansion plays a valuable role in improving the performance of our Search Engines based on how it is used. We generate multiple expansions for each query by implementing a Python script. This script retrieves the *.trec topic file and generates synonymous phrases using Meta Llama 3 [13] and Mistral-7B [14] models, both of which are open-source. The prompt used for generating is as follows: Instruction: Please provide {num_expansion} synonyms in French for the given keyword that convey similar meanings, The output should be a list of words separated by commas without any further punctuation. The keyword is {word}. Following this procedure, we slightly cleaned the generated file. Occasionally encountering low- quality or missing generations, we implemented a method in the search section to automatically switch to a second model if the initial one fails. The sample result for the prompt is: Table 1 Sample results for query expansion using the given prompt. Query Meta Llama 3 Mistral-7B 1. programmes antivirus libres 1. logiciel antivirus gratuit anti-virus gratuit 2. solutions antivirus gratuites. 2. solution antivirus gratuit 1. clôture 1. plaque de bois bardeau 2. écran 2. tableau 3. barrage 3. pannée 3.4.2. Query Boosting Query Boosting is a technique used to adjust the score of documents retrieved by a search query, allowing for customization of document relevance based on specific criteria. When executing a query, Lucene assigns a score to each matching document based on its relevance. Query boosting enables modification of these scores for particular documents or groups. Through experimentation, we found that mixing query expansion and boolean queries sometimes resulted in poorer outcomes. However, by introducing Lucene’s BoostQuery [15], we observed improvements in our evaluation metrics. We experimented with three approaches: • Multiplying by the number of expansions we have. • Utilizing a fine-tuned parameter of 14.68, multiplied by the number of expansions we have. 𝑡𝑜𝑡𝑎𝑙_𝑒𝑥𝑝𝑎𝑛𝑠𝑖𝑜𝑛 • 14.68 × (𝑛𝑢𝑚_𝑒𝑥𝑝𝑎𝑛𝑠𝑖𝑜𝑛 + 𝑛𝑢𝑚_𝑒𝑥𝑝𝑎𝑛𝑠𝑖𝑜𝑛 − 1) while – num_expansion is the number of queries we selected in query expansion. – total_expansion is the total number of queries we expanded. our reasoning for this approach was to prioritize exact terms when a word has a lot of meanings. However, our findings suggest that this strategy does not produce the anticipated results. To summarize, we settled on the second approach, using a boolean query where we added at most three expansions with the SHOULD term, and boosted the main query with MUST. 3.4.3. Document Re-Ranking Document Re-Ranking is the process of taking the initially ranked list of documents (or items) and re-evaluating their relevance or importance based on new information, constraints, or preferences. our approaches for document Re-Ranking are: • Secondary ranking function: Apply a secondary ranking function that considers additional criteria or constraints. • Score adjustment: Modify the scores of individual documents based on another score. For the ranking function, we are considering two different approaches: using Bidirectional Encoder Representations from Transformers (BERT) models and using LLMs. In both cases, the objective is to compute the embeddings of the words and determine the cosine similarity between the query and the document. After research, we attempt to find a fast SBERT [16] model and a well-tuned LLM to assess performance. We opted for jina-reranker-v1-turbo-en, designed for rapid reranking while maintaining competitive performance, leveraging JinaBERT [17] model as its foundation. Additionally, for the LLM, we chose sentence-croissant-llm-base, engineered to produce French text embeddings. It has been fine-tuned using the recently pre-trained LLM croissantllm/CroissantLLMBase [18]. In the end, we found that employing LLMs for re-ranking is computationally expensive and produces nearly identical results. Consequently, we decided to utilize jina-reranker-v1-turbo-en, ranking the first 200 documents and leaving the rest unchanged. For score adjustment we used two ways: • Simple Mode: change the score of the document directly based on secondary ranking function. • Harmonic Mode: combine the BM25 score with the secondary ranking function score. 2 𝐻=1 /𝑥1 + 1/𝑥2 The Harmonic mode, based on the results, performs better. 4. Experimental Setup The experimental setup for our Information Retrieval (IR) system includes using the LongEval collection, which is the official training collection for the 2024 LongEval IR Lab (https://clef-longeval.github.io/). The collection contains French-language web pages and queries, along with their English translations. We used the French data for our experiments. To assess the performance of our IR system, we used the trec_eval executable to evaluate the results under various configurations. We monitored improvements in the following evaluation metrics produced by trec_eval: • num_ret: Number of documents retrieved for each query. • num_rel: Number of relevant documents for each query. • num_rel_ret: Number of relevant documents retrieved for each query. • map: Mean Average Precision, indicating the average relevance of retrieved documents across all queries. • rprec: R-Precision, calculated at the rank corresponding to the number of relevant documents for each query. • p@5 & p@10: Precision at 5 and at 10, representing precision scores computed at the top 5 and 10 retrieved documents for each query. • nDCG: Normalized Discounted Cumulative Gain, a metric evaluating ranked lists by considering item relevance. Our project’s Git repository is publicly available at (https://bitbucket.org/upd-dei-stud-prj/seupd2324-kalu/ src/master), The code is openly accessible for replication. We used a MacBook Pro with an M2 Max chip, 12-core CPU, 30-core GPU, and 32GB RAM to compute our runs. 5. Results and Discussion In this section, we present some of the most fitting results obtained during the development phase. We are considering five primary milestones that, after multiple trials, substantially improved our Mean Average Precision (MAP) score and the overall number of relevant documents retrieved. Several models were evaluated, focusing on re-ranking and query expansion techniques. Table 2 Parameters used in the 5 different runs submitted to CLEF Parameter Run 1 Run 2 Run 3 Run 4 Run 5 Token Filter Porter FrenchLight FrenchLight FrenchLight FrenchLight Tokenizer Standard Standard Standard Standard Standard Length Filter 2-15 2-15 2-15 2-15 2-15 Stop Filter "None" "stoplist-fr" "stoplist-fr" "stoplist-fr" "stoplist-fr" Lower Case Filter Yes Yes Yes Yes Yes Similarity BM25 BM25 BM25 BM25 BM25 Query Expansion No Yes Yes Yes Yes Query Expansion Model - Llama 3 Mistral* Mistral* Mistral* Boolean Clause Main Query Mode "SHOULD" "SHOULD" "SHOULD" "SHOULD" "MUST " Re-ranking No Yes Yes Yes Yes Score Combination Mode - Simple Simple Harmonic Harmonic Num. of Re-rank Documnet - 100 100 200 200 * If the model failed, it would switch to another one. 5.1. Results on training data Table 3 Results for systems (top-1000 documents), on the French Train collection and the train query set of LongEval. metrics run1 run2 run3 run4 run5 num_q 597 599 599 599 597 num_ret 584902 598281 599000 599000 587943 num_rel 4344 4362 4362 4362 4350 num_rel_ret 3542 3553 3583 3583 3578 map 0.1853 0.2286 0.2313 0.2366 0.2374 gm_map 0.0484 0.0659 0.0744 0.0817 0.0841 Rprec 0.1756 0.2271 0.2281 0.23 0.2308 recal_10 0.231 0.2855 0.2857 0.2915 0.2925 recall_100 0.5666 0.574 0.5826 0.6174 0.6192 recall_1000 0.8123 0.812 0.821 0.821 0.8225 ndcg 0.3692 0.4065 0.4097 0.4161 0.4173 ndcg_rel 0.2875 0.3274 0.329 0.3344 0.3354 Rndcg 0.2243 0.269 0.2701 0.2745 0.2754 ndcg_cut_10 0.1954 0.2464 0.2458 0.2511 0.252 ndcg_cut_100 0.3162 0.3556 0.3591 0.3727 0.374 ndcg_cut_1000 0.3692 0.4065 0.4097 0.4161 0.4173 map_cut_10 0.1271 0.1691 0.1702 0.1729 0.1735 map_cut_100 0.1807 0.2244 0.2273 0.2331 0.2339 map_cut_1000 0.1853 0.2286 0.2313 0.2366 0.2374 Initially, we found out that using the FrenchLightStemFilter [19] as the stemmer, and adjusting the length filter from 2 to 15 (reflecting the tendency of French to have longer words), yielded very positive results [3]. Then to continue we introduce four models: base model, re-rank 100 documents with simple score combination mode, re-rank 100 documents with simple score combination mode using Mistral query expansion with a threshold of three words, and re-rank 100 documents with simple score combination mode using Llama query expansion with three words. The third model, utilizing Mistral query expansion, achieved the highest MAP of 0.044, surpassing the base model. Subsequently, two more models were introduced, we tried to see the difference between score combination models and handling empty expansion cases, so basically, we achieved a higher MAP of 0.0487 compared to the base model, handling empty cases if necessary, utilizing stopwords, and using harmonic score combination. To sum up, The most successful approaches involved re-ranking using Mistral query expansion with Llama3 replacement, threshold three, and the inclusion of stopwords, with the harmonic mean performing the best among these methods. Additional models were tested, including summarizing texts using LLM and integrating them into the original texts before indexing, or completely replacing the original text with the summary with flan-t5-3b-summarizer[20]. However, these approaches provided similar results to the simpler methods and required significantly more time to execute. Furthermore, various boosting methods were explored, but most decreased the MAP in the training dataset (see Section 3.4.2). There was also consideration of discarding the use of LLM for re-ranking due to its poor performance. run1 run2 run3 run4 run5 0.5 0.4 0.3 PRECISION 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 4: Standard Recall Levels vs Interpolated Precision 5.2. Results on Test data In this section we have provided the results obtained by running our algorithms on each of the two available test collections which are short term and long term. Table 4 Scores of all five systems on short term test collection. Evaluation measure run1 run2 run3 run4 run5 Map 0.1578 0.1855 0.1875 0.1922 0.1922 nDCG 0.2984 0.3225 0.3240 0.3302 0.3302 nDCG@10 0.1886 0.2247 0.2264 0.2297 0.2297 P@10 0.1472 0.1745 0.1752 0.1789 0.1789 Recall 0.5884 0.5867 0.5884 0.5884 0.5884 Table 5 Scores of all five systems on long term test collection. Evaluation measure run1 run2 run3 run4 run5 Map 0.1067 0.1400 0.1395 0.1430 0.1434 nDCG 0.2193 0.2502 0.2494 0.2535 0.2542 nDCG@10 0.1479 0.1921 0.1912 0.1931 0.1936 P@10 0.1145 0.1407 0.1397 0.1413 0.1417 Recall 0.4142 0.4136 0.4131 0.4131 0.4142 6. Statistical Analysis In this section, we conduct a statistical analysis on the retrieval effectiveness for our five submitted runs to CLEF. This evaluation aims to assess each run’s performance and find out how well the system retrieves and ranks relevant documents. We compared the Normalized Discounted Cumulative Gain (nDCG) and Mean Average Precision (MAP) of each runs to understand the performance differences among them, considering both short-term and long-term evaluations. The analysis involves the use of tools such as box plots, two-way Anova and Tukey. For analysis first we used box plots which are used to represent a distribution of data concisely. Additionally, we applied two-way Analysis of Variance (ANOVA) tests to explore the differences observed in both short-term and long-term evaluations. In addition, we use the Tukey Honest Significant Difference (HSD) test, a post-hoc analysis for ANOVA, which compares group means while controlling for multiple comparisons, to ensure reliable identification of significant differences. Table 6 Recap of our runs submitted to CLEF. runs language type run1 French Base run2 French Query Expansion using LLama 3 run3 French Re-ranking simple mode run4 French Harmonic Re-ranking using Should run5 French Harmonic Re-ranking using Must 6.1. Box Plot Box plots are graphical tools to represent a distribution of data concisely. In our case, we want to plot the distribution of the scores achieved by our submitted systems on each query of the different test sets, with respect to nDCG and Map. (a) long-term set runs (b) short-term set runs Figure 5: box plot for both the short-term and long-term set runs (nDCG performance) By analysing the nDCG performance of all runs in short-term set we observe that run1 achieve lower nDCG scores, indicating their inferior effectiveness in capturing and ranking relevant documents while the other 4 runs exhibit approximately similar levels of performance. This nDCG performance result is also the same in long-term set runs. (a) long-term set runs (b) short-term set runs Figure 6: box plot for both the short-term and long-term set runs (Map performance) From the boxplot, we can observe the distribution of MAP scores for each run. By analysing the Map performance of all runs in short-term and long-term set we observe that run1 has the lowest Map scores, indicating their inferior effectiveness in accuracy, run2 and run3 have approximately same Map scores in short and long-term evaluations while run4 and run5 have the highest Map scores with a slight difference with run2 and run3. 6.2. Two-way ANOVA In a two-way ANOVA test, we check if the factors Topic and System can influence the results and we test on both MAP and nDCG measures. Table 7 two way ANOVA Results for short-term nDCG Source df SS MS F PR(>F) Columns(Systems) 4.0 0.2785 0.0696 21.3414 3.594524e-17 Rows(Topics) 403.0 82.8188 0.2055 62.9957 0 Error 1612.0 5.2587 0.0033 - - Total 2019.0 88.3559 - - - Table 8 two way ANOVA Results for short-term Map Source df SS MS F PR(>F) Columns(Systems) 4.0 0.3358 0.0840 24.6253 8.216775e-20 Rows(Topics) 403.0 67.4513 0.1674 49.0915 0 Error 1612.0 5.4960 0.0034 - - Total 2019.0 73.2831 - - - Table 9 two way ANOVA Results for long-term nDCG Source df SS MS F PR(>F) Columns(Systems) 4.0 1.2535 0.3134 11.8346 1.4110e-09 Rows(Topics) 1511.0 103.3381 0.0684 2.5829 1.8578e-142 Error 6044.0 160.0363 0.0265 - - Total 7559.0 264.6278 - - - Table 10 two way ANOVA Results for long-term Map Source df SS MS F PR(>F) Columns(Systems) 4.0 1.4099 0.3525 19.2277 9.8898e-16 Rows(Topics) 1511.0 71.7684 0.0475 2.5909 1.9168e-143 Error 6044.0 110.7997 0.0183 - - Total 7559.0 183.9780 - - - From the result of the two-way ANOVA test we can conclude that both factors (System and Topic) are important in influencing the performance measures (nDCG and MAP) in both short-term and long-term evaluations. Figure 7: Tukey’s HSD test for all the five runs in long-term evaluation Figure 8: Tukey’s HSD test for all the five runs in long-term evaluation 7. Conclusions and Future Work In this work, we present our approach to the CLEF Long Eval LAB 2024 task, which aimed to develop an effective and efficient search engine for web documents. Our approach consisted of combining different techniques, including query expansion, re-ranking, and the use of large language models for different purposes. Our experiments showed good results for our approach, with better effectiveness and efficiency than the baseline system provided by CLEF. Combining two scores in the re-ranking phase also improved retrieval performance. We found several areas to improve our approach further. One promising direction is to use text summarization and title extraction techniques in the parsing part. While we experimented with this approach, it didn’t generate significant improvements due to efficiency concerns. However, we believe that refining this technique or exploring alternative approaches could lead to better results. Another idea is to embed documents using different methods [21] for re-ranking, or text chunks and their summaries because chunking text documents into small pieces is an interesting technique that increases the accuracy and quality of the system which could help capture nuanced semantic relationships between documents. Additionally, including context-awareness when calling LLMs to generate synonyms might have a positive impact on the overall retrieval performance. Furthermore, fine-tune our re-ranker with SBERT using training data and implement a custom re- ranker specific to this specific task. By leveraging the strengths of different models and techniques, we hope to achieve even better results and push the boundaries of what is possible in LongEval information retrieval. References [1] Qwant, About qwant, https://about.qwant.com/en/, 2023. Accessed: 2023-05-20. [2] P. G. R. Deveaud, G. Gonzalez-Saez, P. Mulhem, L. Goeuriot, F. Piroi, M. Popel, Longeval- retrieval: French-english dynamic test collection for continuous web search evaluation, 2023. arXiv:2303.03229. [3] G. Antolini, N. Boscolo, M. Cazzaro, M. Martinelli, S. Safavi, F. Shami, N. Ferro, et al., Seupd@ clef: Team close on temporal persistence of ir systems’ performance, in: CEUR WORKSHOP PROCEEDINGS, volume 3497, CEUR-WS, 2023, pp. 2368–2395. [4] E. Bolzonello, C. Marchiori, D. Moschetta, R. Trevisiol, F. Zanini, N. Ferro, et al., Seupd@ clef: Team faderic on a query expansion and reranking approach for the longeval task, in: CEUR WORKSHOP PROCEEDINGS, volume 3497, CEUR-WS, 2023, pp. 2252–2280. [5] M. Günther, J. Ong, I. Mohr, A. Abdessalem, T. Abel, M. K. Akram, S. Guzman, G. Mastrapas, S. Sturua, B. Wang, M. Werk, N. Wang, H. Xiao, Jina embeddings 2: 8192-token general-purpose text embeddings for long documents, 2024. arXiv:2310.19923. [6] A. Lucene, Standardtokenizer, https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/ analysis/standard/StandardTokenizer.html, 2024. Accessed: 2024-05-20. [7] A. Lucene, Frenchlightstemmer, https://lucene.apache.org/core/6_2_0/analyzers-common/org/ apache/lucene/analysis/fr/FrenchLightStemmer.html, 2024. Accessed: 2024-04-20. [8] A. Lucene, Luke, https://lucene.apache.org/core/8_11_0/luke/index.html, 2024. Accessed: 2024-05- 20. [9] A. Lucene, Apache lucene, https://lucene.apache.org/, 2023. Accessed: 2023-05-20. [10] Kaggle, Frenchkagglestoplist, https://www.kaggle.com/datasets/heeraldedhia/ stop-words-in-28-languages?select=french.txt, ???? [11] A. Lucene, Lucene elisionfilter, https://lucene.apache.org/core/7_3_1/analyzers-common/org/ apache/lucene/analysis/util/ElisionFilter.html, 2024. Accessed: 2023-04-20. [12] A. Lucene, Lucene bm25similarity, https://lucene.apache.org/core/7_0_1/core/org/apache/lucene/ search/similarities/BM25Similarity.html, 2024. Accessed: 2024-04-20. [13] AI@Meta, Llama 3 model card (2024). URL: https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md. [14] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. arXiv:2310.06825. [15] A. Lucene, Lucene boostquery, https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/ search/BoostQuery.html, 2024. Accessed: 2024-04-20. [16] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2019. URL: https://arxiv.org/abs/1908.10084. [17] M. Günther, J. Ong, I. Mohr, A. Abdessalem, T. Abel, M. K. Akram, S. Guzman, G. Mastrapas, S. Sturua, B. Wang, M. Werk, N. Wang, H. Xiao, Jina embeddings 2: 8192-token general-purpose text embeddings for long documents, 2023. arXiv:2310.19923. [18] M. Faysse, P. Fernandes, N. M. Guerreiro, A. Loison, D. M. Alves, C. Corro, N. Boizard, J. Alves, R. Rei, P. H. Martins, A. B. Casademunt, F. Yvon, A. F. T. Martins, G. Viaud, C. Hudelot, P. Colombo, Croissantllm: A truly bilingual french-english language model, 2024. arXiv:2402.00786. [19] A. S. Foundation, Apache solr frenchlightstemfilter, https://solr.apache.org/guide/6_6/ language-analysis.html#LanguageAnalysis-FrenchLightStemFilter, 2024. Accessed: 2024-04-20. [20] J. Clive, Multi-purpose summarizer (fine-tuned google/flan-t5-xl on several summarization datasets), https://huggingface.co/jordiclive/flan-t5-3b-summarizer, 2023. URL: https://huggingface. co/jordiclive/flan-t5-3b-summarizer, apache 2.0 and BSD-3-Clause License. Fine-tuned on various summarization datasets including xsum, wikihow, cnn_dailymail/3.0.0, samsum, scitldr/AIC, bill- sum, TLDR. Designed for academic and general usage with control over summary type by varying the instruction prepended to the source document. [21] N. Muennighoff, N. Tazi, L. Magne, N. Reimers, Mteb: Massive text embedding benchmark, arXiv preprint arXiv:2210.07316 (2022). URL: https://arxiv.org/abs/2210.07316. doi:10.48550/ARXIV. 2210.07316.