<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">SEUPD@CLEF: Team Kalu on improving Search Engine Performance with Query Expansion and Re-Ranking Approach</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Kimia</forename><surname>Abedini</surname></persName>
							<email>kimia.abedini@studenti.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Akan</forename><surname>Akysh</surname></persName>
							<email>akan.akysh@studenti.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Arwa</forename><surname>Fahoud</surname></persName>
							<email>arwa.fahoud@studenti.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nicola</forename><surname>Ferro</surname></persName>
							<email>nicola.ferro@unipd.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">SEUPD@CLEF: Team Kalu on improving Search Engine Performance with Query Expansion and Re-Ranking Approach</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">8DAD8909CA492B5770D46325D7934EAA</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:02+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Information Retrieval</term>
					<term>Search Engines</term>
					<term>Retrieve Documents</term>
					<term>Query Expansion</term>
					<term>LongEval</term>
					<term>CLEF</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This report provides a detailed description of the search engine system designed by Team KALU for the Conference and Labs of the Evaluation Forum (CLEF) LongEval LAB 2024 Task 1. The team, composed of students from the University of Padua, developed this system to efficiently index, search, and retrieve documents. We begin by outlining the problem and then go on to describe our system which mainly works on a collection of documents written in French language, then we explain the various methodologies we implemented. We present our experimental results and analyze them according to the techniques we employed. Finally, We present the outcomes of our experiments and discuss the different techniques used.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Search engines have transformed people's access to information. These systems proved that they stand as fundamental tools in people's daily lives, providing a vast amount of data related to various aspects, from academic research and global news to everyday queries like shopping and local weather. However, the exponential growth in data available online presents a significant challenge for search engines in terms of storage, indexing, and retrieval. This paper introduces a solution to the issue by creating an information retrieval system capable of adjusting to evolving data while sustaining high performance.</p><p>Our approach is implementing a search engine to address task 1 of the "LongEval" lab proposed by CLEF 2024, which aims to search a corpus of documents and retrieve the most relevant ones to a predefined set of queries gathered from Qwant <ref type="bibr" target="#b0">[1]</ref>.</p><p>The paper is structured as follows: Section 2 shows the related work we have started from; Section 3 outlines our approach; Section 4 explains our experimental setup; Section 5 discusses our main findings; and Section 7 concludes with reflections and future directions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>We used the paper by LongEval organizers <ref type="bibr" target="#b1">[2]</ref> to understand the task and the datasets provided by CLEF. This helped us understand how the documents and queries were collected and what the main objectives of the task were. The paper also provided baseline performances, which we used to benchmark our system's development.</p><p>Based on the works of the CLOSE <ref type="bibr" target="#b2">[3]</ref> and the FADERIC <ref type="bibr" target="#b3">[4]</ref> teams, we chose to utilize query expansion techniques. We explored various methods, such as different Large Language Model (LLM)s and different prompts. Also, We build our re-ranking approach based on the work of JinaAi <ref type="bibr" target="#b4">[5]</ref> which developed the jina-reranker model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>In this section, we describe the steps we took to create the different components that comprise the search engine with the different configurations used with each component.   • ParsedDocument class represents a parsed document to be indexed. It has two attributes: ID for the unique identifier of the document and body for the document's content. This class provides functionalities to set and retrieve documents' attributes. • DocumentParser class represents an abstract class providing basic functionalities to iterate over the elements of a ParsedDocument, reading and parsing its content. • LongEvalDocumentParser class is the specific DocumentParser for the LongEval corpus. It provides an implementation of a parser for the documents in the TREC format. It reads the document and it replaces all the tags with space and returns ParsedDocument that contains the ID and the Body of the document.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Analyzer</head><p>Analyzer is used to apply further processing steps to both the documents and queries. The Analyzer does the functionalities of "Tokenization", "Stemming" and filtering "Stop Words".</p><p>• Tokenization is the process of splitting text into tokens (each token can be thought as a single word). We have used StandardTokenizer <ref type="bibr" target="#b5">[6]</ref> in our project, which is one of the tokenizers provided by Apache Lucene. • Stemming is The process of removing suffixes or prefixes from words to obtain their root form.</p><p>in this system we have tried using FrenchLightStemmer <ref type="bibr" target="#b6">[7]</ref>. • Stop Words removal is the process of eliminating words that frequently appear in most documents and carry little meaningful information for search queries.our stopwords list includes words extracted from Luke <ref type="bibr" target="#b7">[8]</ref> which is a tool provided by Lucene <ref type="bibr" target="#b8">[9]</ref> and also we used the stopword list in Kaggle <ref type="bibr" target="#b9">[10]</ref>. Removing stopwords improved the search engine's performance by reducing the index size and, consequently, decreasing the time needed to search the index. Also, we used FrenchElisionFilter <ref type="bibr" target="#b10">[11]</ref> that is addresses elision phenomena in French, where certain characters, such as 'l', 'd', 's', 't', 'n', and 'm', followed by an apostrophe, are contracted by eliminating the apostrophe and associated character.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Indexer</head><p>Creating an index is one of the main process where we generate a searchable database, known as an index, for parsed documents. This index holds crucial information about the documents such as the words and phrases they contain, their frequency, and their locations within the document. Indexing facilitates swift document retrieval by enabling users to search based on keywords or phrases. To accomplish this task, we developed the following components:</p><p>• analyzer: is the analyzer to be used, which is LongEvalAnalyzer.</p><p>• similarity: is an object needed to score the relevance of a document based on the query terms it contains. which can be implemented to be either Lucene default implementation that is based on a variant of the Term Frequency-Inverse Document Frequency (TF-IDF) model, or the modern alternative BM25 which is used in our case. • ramBufferSizeMB: the size in megabytes of the RAM buffer for indexing the documents.</p><p>• indexPath: the path to the directory where the generated index should be stored.</p><p>• docsPath: the path to the documents directory.</p><p>• dpCls: an object of the DocumentParser which is responsible for parsing the documents in the collection.</p><p>We used BM25 Similarity because unlike TF-IDF, where term frequency linearly affects the score, BM25 introduces a saturation point, which prevents the term frequency component from indefinitely influencing the score. BM25 scores a document based on the query terms appearing in it using the following formula:</p><formula xml:id="formula_0">BM25(𝐷, 𝑄) = 𝑛 ∑︁ 𝑖=1 IDF(𝑞 𝑖 ) • 𝑓 (𝑞 𝑖 , 𝐷) • (𝑘 1 + 1) 𝑓 (𝑞 𝑖 , 𝐷) + 𝑘 1 • (︁ 1 − 𝑏 + 𝑏 • |𝐷| avgdl )︁</formula><p>where 𝐷 is the document being scored, 𝑄 is the query consisting of words 𝑞 1 , 𝑞 2 , . . . , 𝑞 𝑛 , 𝑓 (𝑞 𝑖 , 𝐷) is the frequency of the term 𝑞 𝑖 in document 𝐷, IDF(𝑞 𝑖 ) is the inverse document frequency of term 𝑞 𝑖 , |𝐷| is the length of the document, avgdl is the average document length in the text collection. Finally, 𝑘 1 and 𝑏 are free parameters, we left 𝑘 1 = 1.2 and 𝑏 = 0.75 same as most of the applications.</p><p>After setting the configuration of the indexer, it does the following:</p><p>• The indexer walks through the documents directory to find documents (specifically .txt files) and processes each file for indexing. • It uses DocumentParser class, which is tasked with parsing the documents and creating structured data ParsedDocument, including documents identifiers and bodies. • Each parsed document is converted into a Lucene Document object and added to the index. The document's ID and body are stored as fields within the Lucene Document.</p><p>The configurations used for the fields added to the Lucene Document are :</p><p>• IDField: The ID field is created by only storing the document ID without storing term frequencies or positions, which are unnecessary for unique identifiers like document IDs, so only the original value of the field (the document ID) is stored directly in the index, allowing it to be retrieved when querying the index.</p><p>• BodyField: The body field of the document is configured to store the terms resulted form splitting the body of the document into tokens, along with their frequencies, without storing the original text of the document in an attempt to reduce the size of the index.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Searcher</head><p>The Searcher's task is scanning indexed documents, analyzing user queries, and retrieving relevant information. It then presents a ranked list of documents that satisfy the user's information needs.</p><p>Our implementation does so by accepting the following parameters:</p><p>• analyzer: in this case, an instance of our Analyzer.</p><p>• similarity: we decided to use the BM25Similarity <ref type="bibr" target="#b11">[12]</ref> function for the first stage of document retrieval due to its efficiency and higher effectiveness compared to other methods. • Run options: parameters for the index path, the topics path, the run path and the run name, the number of the expected topics, and the maximum number of documents retrieved (in our case 1000). • Search options: parameters for query boosting, query boosting value, number of documents to be re-ranked, score calculation mode, query expansion mode, and LLM used for generating the expansion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.1.">Query Expansion</head><p>Query Expansion plays a valuable role in improving the performance of our Search Engines based on how it is used. We generate multiple expansions for each query by implementing a Python script. This script retrieves the *.trec topic file and generates synonymous phrases using Meta Llama 3 <ref type="bibr" target="#b12">[13]</ref> and Mistral-7B <ref type="bibr" target="#b13">[14]</ref> models, both of which are open-source. The prompt used for generating is as follows:</p><p>Instruction: Please provide {num_expansion} synonyms in French for the given keyword that convey similar meanings, The output should be a list of words separated by commas without any further punctuation.</p><p>The keyword is {word}.</p><p>Following this procedure, we slightly cleaned the generated file. Occasionally encountering lowquality or missing generations, we implemented a method in the search section to automatically switch to a second model if the initial one fails.</p><p>The sample result for the prompt is: When executing a query, Lucene assigns a score to each matching document based on its relevance. Query boosting enables modification of these scores for particular documents or groups. Through experimentation, we found that mixing query expansion and boolean queries sometimes resulted in poorer outcomes. However, by introducing Lucene's BoostQuery <ref type="bibr" target="#b14">[15]</ref>, we observed improvements in our evaluation metrics. We experimented with three approaches:</p><p>• Multiplying by the number of expansions we have.</p><p>• Utilizing a fine-tuned parameter of 14.68, multiplied by the number of expansions we have. • 14.68 × (𝑛𝑢𝑚_𝑒𝑥𝑝𝑎𝑛𝑠𝑖𝑜𝑛 + 𝑡𝑜𝑡𝑎𝑙_𝑒𝑥𝑝𝑎𝑛𝑠𝑖𝑜𝑛 𝑛𝑢𝑚_𝑒𝑥𝑝𝑎𝑛𝑠𝑖𝑜𝑛 − 1) while -num_expansion is the number of queries we selected in query expansion.</p><p>-total_expansion is the total number of queries we expanded. our reasoning for this approach was to prioritize exact terms when a word has a lot of meanings. However, our findings suggest that this strategy does not produce the anticipated results.</p><p>To summarize, we settled on the second approach, using a boolean query where we added at most three expansions with the SHOULD term, and boosted the main query with MUST.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.3.">Document Re-Ranking</head><p>Document Re-Ranking is the process of taking the initially ranked list of documents (or items) and re-evaluating their relevance or importance based on new information, constraints, or preferences. our approaches for document Re-Ranking are:</p><p>• Secondary ranking function: Apply a secondary ranking function that considers additional criteria or constraints. • Score adjustment: Modify the scores of individual documents based on another score.</p><p>For the ranking function, we are considering two different approaches: using Bidirectional Encoder Representations from Transformers (BERT) models and using LLMs. In both cases, the objective is to compute the embeddings of the words and determine the cosine similarity between the query and the document.</p><p>After research, we attempt to find a fast SBERT <ref type="bibr" target="#b15">[16]</ref> model and a well-tuned LLM to assess performance. We opted for jina-reranker-v1-turbo-en, designed for rapid reranking while maintaining competitive performance, leveraging JinaBERT <ref type="bibr" target="#b16">[17]</ref> model as its foundation. Additionally, for the LLM, we chose sentence-croissant-llm-base, engineered to produce French text embeddings. It has been fine-tuned using the recently pre-trained LLM croissantllm/CroissantLLMBase <ref type="bibr" target="#b17">[18]</ref>.</p><p>In the end, we found that employing LLMs for re-ranking is computationally expensive and produces nearly identical results. Consequently, we decided to utilize jina-reranker-v1-turbo-en, ranking the first 200 documents and leaving the rest unchanged.</p><p>For score adjustment we used two ways:</p><p>• Simple Mode: change the score of the document directly based on secondary ranking function.</p><p>• Harmonic Mode: combine the BM25 score with the secondary ranking function score.</p><formula xml:id="formula_1">𝐻 = 2 1 /𝑥1 + 1 /𝑥2</formula><p>The Harmonic mode, based on the results, performs better.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Setup</head><p>The experimental setup for our Information Retrieval (IR) system includes using the LongEval collection, which is the official training collection for the 2024 LongEval IR Lab (https://clef-longeval.github.io/). The collection contains French-language web pages and queries, along with their English translations. We used the French data for our experiments.</p><p>To assess the performance of our IR system, we used the trec_eval executable to evaluate the results under various configurations. We monitored improvements in the following evaluation metrics produced by trec_eval:</p><p>• num_ret: Number of documents retrieved for each query.</p><p>• num_rel: Number of relevant documents for each query.</p><p>• num_rel_ret: Number of relevant documents retrieved for each query.</p><p>• map: Mean Average Precision, indicating the average relevance of retrieved documents across all queries. • rprec: R-Precision, calculated at the rank corresponding to the number of relevant documents for each query. • p@5 &amp; p@10: Precision at 5 and at 10, representing precision scores computed at the top 5 and 10 retrieved documents for each query. • nDCG: Normalized Discounted Cumulative Gain, a metric evaluating ranked lists by considering item relevance.</p><p>Our project's Git repository is publicly available at (https://bitbucket.org/upd-dei-stud-prj/seupd2324-kalu/ src/master), The code is openly accessible for replication. We used a MacBook Pro with an M2 Max chip, 12-core CPU, 30-core GPU, and 32GB RAM to compute our runs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results and Discussion</head><p>In this section, we present some of the most fitting results obtained during the development phase. We are considering five primary milestones that, after multiple trials, substantially improved our Mean Average Precision (MAP) score and the overall number of relevant documents retrieved. Several models were evaluated, focusing on re-ranking and query expansion techniques.  Initially, we found out that using the FrenchLightStemFilter <ref type="bibr" target="#b18">[19]</ref> as the stemmer, and adjusting the length filter from 2 to 15 (reflecting the tendency of French to have longer words), yielded very positive results <ref type="bibr" target="#b2">[3]</ref>. Then to continue we introduce four models: base model, re-rank 100 documents with simple score combination mode, re-rank 100 documents with simple score combination mode using Mistral query expansion with a threshold of three words, and re-rank 100 documents with simple score combination mode using Llama query expansion with three words. The third model, utilizing Mistral query expansion, achieved the highest MAP of 0.044, surpassing the base model. Subsequently, two more models were introduced, we tried to see the difference between score combination models and handling empty expansion cases, so basically, we achieved a higher MAP of 0.0487 compared to the base model, handling empty cases if necessary, utilizing stopwords, and using harmonic score combination.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Results on training data</head><p>To sum up, The most successful approaches involved re-ranking using Mistral query expansion with Llama3 replacement, threshold three, and the inclusion of stopwords, with the harmonic mean performing the best among these methods.</p><p>Additional models were tested, including summarizing texts using LLM and integrating them into the original texts before indexing, or completely replacing the original text with the summary with flan-t5-3b-summarizer <ref type="bibr" target="#b19">[20]</ref>. However, these approaches provided similar results to the simpler methods and required significantly more time to execute. Furthermore, various boosting methods were explored, but most decreased the MAP in the training dataset (see Section 3.4.2). There was also consideration of discarding the use of LLM for re-ranking due to its poor performance. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Results on Test data</head><p>In this section we have provided the results obtained by running our algorithms on each of the two available test collections which are short term and long term. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Statistical Analysis</head><p>In this section, we conduct a statistical analysis on the retrieval effectiveness for our five submitted runs to CLEF. This evaluation aims to assess each run's performance and find out how well the system retrieves and ranks relevant documents.</p><p>We compared the Normalized Discounted Cumulative Gain (nDCG) and Mean Average Precision (MAP) of each runs to understand the performance differences among them, considering both short-term and long-term evaluations. The analysis involves the use of tools such as box plots, two-way Anova and Tukey.</p><p>For analysis first we used box plots which are used to represent a distribution of data concisely. Additionally, we applied two-way Analysis of Variance (ANOVA) tests to explore the differences observed in both short-term and long-term evaluations. In addition, we use the Tukey Honest Significant Difference (HSD) test, a post-hoc analysis for ANOVA, which compares group means while controlling for multiple comparisons, to ensure reliable identification of significant differences. French Harmonic Re-ranking using Must</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Box Plot</head><p>Box plots are graphical tools to represent a distribution of data concisely. In our case, we want to plot the distribution of the scores achieved by our submitted systems on each query of the different test sets, with respect to nDCG and Map. By analysing the nDCG performance of all runs in short-term set we observe that run1 achieve lower nDCG scores, indicating their inferior effectiveness in capturing and ranking relevant documents while the other 4 runs exhibit approximately similar levels of performance. This nDCG performance result is also the same in long-term set runs.  From the boxplot, we can observe the distribution of MAP scores for each run. By analysing the Map performance of all runs in short-term and long-term set we observe that run1 has the lowest Map scores, indicating their inferior effectiveness in accuracy, run2 and run3 have approximately same Map scores in short and long-term evaluations while run4 and run5 have the highest Map scores with a slight difference with run2 and run3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Two-way ANOVA</head><p>In a two-way ANOVA test, we check if the factors Topic and System can influence the results and we test on both MAP and nDCG measures. From the result of the two-way ANOVA test we can conclude that both factors (System and Topic) are important in influencing the performance measures (nDCG and MAP) in both short-term and long-term evaluations.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions and Future Work</head><p>In this work, we present our approach to the CLEF Long Eval LAB 2024 task, which aimed to develop an effective and efficient search engine for web documents. Our approach consisted of combining different techniques, including query expansion, re-ranking, and the use of large language models for different purposes. Our experiments showed good results for our approach, with better effectiveness and efficiency than the baseline system provided by CLEF. Combining two scores in the re-ranking phase also improved retrieval performance. We found several areas to improve our approach further.</p><p>One promising direction is to use text summarization and title extraction techniques in the parsing part. While we experimented with this approach, it didn't generate significant improvements due to efficiency concerns. However, we believe that refining this technique or exploring alternative approaches could lead to better results.</p><p>Another idea is to embed documents using different methods <ref type="bibr" target="#b20">[21]</ref> for re-ranking, or text chunks and their summaries because chunking text documents into small pieces is an interesting technique that increases the accuracy and quality of the system which could help capture nuanced semantic relationships between documents. Additionally, including context-awareness when calling LLMs to generate synonyms might have a positive impact on the overall retrieval performance. Furthermore, fine-tune our re-ranker with SBERT using training data and implement a custom reranker specific to this specific task. By leveraging the strengths of different models and techniques, we hope to achieve even better results and push the boundaries of what is possible in LongEval information retrieval.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Workflow of the IR system implemented by KALU.</figDesc><graphic coords="2,117.13,128.96,361.02,221.45" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>3. 1 .</head><label>1</label><figDesc>ParserParser processes the collection of documents, extracting valuable information and filtering out irrelevant data. we also use parser to extract the id and body of the documents. our parser consists of three classes which are ParsedDocumentclass, DocumentParser class and LongEvalDocumentParser class.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Document structure without parsing.</figDesc><graphic coords="2,207.38,478.59,180.50,219.16" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Analyzer Process.</figDesc><graphic coords="3,117.13,421.60,361.00,199.03" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Standard Recall Levels vs Interpolated Precision</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: box plot for both the short-term and long-term set runs (nDCG performance)</figDesc><graphic coords="11,72.00,65.61,221.13,177.45" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head></head><label></label><figDesc>(a) long-term set runs (b) short-term set runs</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: box plot for both the short-term and long-term set runs (Map performance)</figDesc><graphic coords="11,72.00,366.56,221.13,177.45" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Tukey's HSD test for all the five runs in long-term evaluation</figDesc><graphic coords="12,72.00,544.04,221.13,165.94" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: Tukey's HSD test for all the five runs in long-term evaluation</figDesc><graphic coords="13,72.00,65.61,221.13,165.94" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Sample results for query expansion using the given prompt.Query Boosting is a technique used to adjust the score of documents retrieved by a search query, allowing for customization of document relevance based on specific criteria.</figDesc><table><row><cell>Query</cell><cell>Meta Llama 3</cell><cell>Mistral-7B</cell></row><row><cell>anti-virus gratuit</cell><cell>1. programmes antivirus libres 2. solutions antivirus gratuites.</cell><cell>1. logiciel antivirus gratuit 2. solution antivirus gratuit</cell></row><row><cell></cell><cell>1. clôture</cell><cell>1. plaque de bois</cell></row><row><cell>bardeau</cell><cell>2. écran</cell><cell>2. tableau</cell></row><row><cell></cell><cell>3. barrage</cell><cell>3. pannée</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Parameters used in the 5 different runs submitted to CLEF</figDesc><table><row><cell>Parameter</cell><cell>Run 1</cell><cell>Run 2</cell><cell>Run 3</cell><cell>Run 4</cell><cell>Run 5</cell></row><row><cell>Token Filter</cell><cell>Porter</cell><cell cols="4">FrenchLight FrenchLight FrenchLight FrenchLight</cell></row><row><cell>Tokenizer</cell><cell>Standard</cell><cell>Standard</cell><cell>Standard</cell><cell>Standard</cell><cell>Standard</cell></row><row><cell>Length Filter</cell><cell>2-15</cell><cell>2-15</cell><cell>2-15</cell><cell>2-15</cell><cell>2-15</cell></row><row><cell>Stop Filter</cell><cell>"None"</cell><cell cols="4">"stoplist-fr" "stoplist-fr" "stoplist-fr" "stoplist-fr"</cell></row><row><cell>Lower Case Filter</cell><cell>Yes</cell><cell>Yes</cell><cell>Yes</cell><cell>Yes</cell><cell>Yes</cell></row><row><cell>Similarity</cell><cell>BM25</cell><cell>BM25</cell><cell>BM25</cell><cell>BM25</cell><cell>BM25</cell></row><row><cell>Query Expansion</cell><cell>No</cell><cell>Yes</cell><cell>Yes</cell><cell>Yes</cell><cell>Yes</cell></row><row><cell>Query Expansion Model</cell><cell>-</cell><cell>Llama 3</cell><cell>Mistral*</cell><cell>Mistral*</cell><cell>Mistral*</cell></row><row><cell cols="2">Boolean Clause Main Query Mode "SHOULD"</cell><cell>"SHOULD"</cell><cell>"SHOULD"</cell><cell>"SHOULD"</cell><cell>"MUST "</cell></row><row><cell>Re-ranking</cell><cell>No</cell><cell>Yes</cell><cell>Yes</cell><cell>Yes</cell><cell>Yes</cell></row><row><cell>Score Combination Mode</cell><cell>-</cell><cell>Simple</cell><cell>Simple</cell><cell>Harmonic</cell><cell>Harmonic</cell></row><row><cell>Num. of Re-rank Documnet</cell><cell>-</cell><cell>100</cell><cell>100</cell><cell>200</cell><cell>200</cell></row></table><note>* If the model failed, it would switch to another one.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Results for systems (top-1000 documents), on the French Train collection and the train query set of LongEval.</figDesc><table><row><cell>metrics</cell><cell>run1</cell><cell>run2</cell><cell>run3</cell><cell>run4</cell><cell>run5</cell></row><row><cell>num_q</cell><cell>597</cell><cell>599</cell><cell>599</cell><cell>599</cell><cell>597</cell></row><row><cell>num_ret</cell><cell cols="5">584902 598281 599000 599000 587943</cell></row><row><cell>num_rel</cell><cell>4344</cell><cell>4362</cell><cell>4362</cell><cell>4362</cell><cell>4350</cell></row><row><cell>num_rel_ret</cell><cell>3542</cell><cell>3553</cell><cell>3583</cell><cell>3583</cell><cell>3578</cell></row><row><cell>map</cell><cell cols="5">0.1853 0.2286 0.2313 0.2366 0.2374</cell></row><row><cell>gm_map</cell><cell cols="5">0.0484 0.0659 0.0744 0.0817 0.0841</cell></row><row><cell>Rprec</cell><cell cols="3">0.1756 0.2271 0.2281</cell><cell>0.23</cell><cell>0.2308</cell></row><row><cell>recal_10</cell><cell>0.231</cell><cell cols="4">0.2855 0.2857 0.2915 0.2925</cell></row><row><cell>recall_100</cell><cell>0.5666</cell><cell>0.574</cell><cell cols="3">0.5826 0.6174 0.6192</cell></row><row><cell>recall_1000</cell><cell>0.8123</cell><cell>0.812</cell><cell>0.821</cell><cell>0.821</cell><cell>0.8225</cell></row><row><cell>ndcg</cell><cell cols="5">0.3692 0.4065 0.4097 0.4161 0.4173</cell></row><row><cell>ndcg_rel</cell><cell cols="2">0.2875 0.3274</cell><cell>0.329</cell><cell cols="2">0.3344 0.3354</cell></row><row><cell>Rndcg</cell><cell>0.2243</cell><cell>0.269</cell><cell cols="3">0.2701 0.2745 0.2754</cell></row><row><cell>ndcg_cut_10</cell><cell cols="4">0.1954 0.2464 0.2458 0.2511</cell><cell>0.252</cell></row><row><cell>ndcg_cut_100</cell><cell cols="4">0.3162 0.3556 0.3591 0.3727</cell><cell>0.374</cell></row><row><cell cols="6">ndcg_cut_1000 0.3692 0.4065 0.4097 0.4161 0.4173</cell></row><row><cell>map_cut_10</cell><cell cols="5">0.1271 0.1691 0.1702 0.1729 0.1735</cell></row><row><cell>map_cut_100</cell><cell cols="5">0.1807 0.2244 0.2273 0.2331 0.2339</cell></row><row><cell>map_cut_1000</cell><cell cols="5">0.1853 0.2286 0.2313 0.2366 0.2374</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Scores of all five systems on short term test collection.</figDesc><table><row><cell>Evaluation measure</cell><cell>run1</cell><cell>run2</cell><cell>run3</cell><cell>run4</cell><cell>run5</cell></row><row><cell>Map</cell><cell cols="5">0.1578 0.1855 0.1875 0.1922 0.1922</cell></row><row><cell>nDCG</cell><cell cols="5">0.2984 0.3225 0.3240 0.3302 0.3302</cell></row><row><cell>nDCG@10</cell><cell cols="5">0.1886 0.2247 0.2264 0.2297 0.2297</cell></row><row><cell>P@10</cell><cell cols="5">0.1472 0.1745 0.1752 0.1789 0.1789</cell></row><row><cell>Recall</cell><cell cols="5">0.5884 0.5867 0.5884 0.5884 0.5884</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5</head><label>5</label><figDesc>Scores of all five systems on long term test collection.</figDesc><table><row><cell>Evaluation measure</cell><cell>run1</cell><cell>run2</cell><cell>run3</cell><cell>run4</cell><cell>run5</cell></row><row><cell>Map</cell><cell cols="5">0.1067 0.1400 0.1395 0.1430 0.1434</cell></row><row><cell>nDCG</cell><cell cols="5">0.2193 0.2502 0.2494 0.2535 0.2542</cell></row><row><cell>nDCG@10</cell><cell cols="5">0.1479 0.1921 0.1912 0.1931 0.1936</cell></row><row><cell>P@10</cell><cell cols="5">0.1145 0.1407 0.1397 0.1413 0.1417</cell></row><row><cell>Recall</cell><cell cols="5">0.4142 0.4136 0.4131 0.4131 0.4142</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 6</head><label>6</label><figDesc>Recap of our runs submitted to CLEF.</figDesc><table><row><cell cols="2">runs language</cell><cell>type</cell></row><row><cell>run1</cell><cell>French</cell><cell>Base</cell></row><row><cell>run2</cell><cell>French</cell><cell>Query Expansion using LLama 3</cell></row><row><cell>run3</cell><cell>French</cell><cell>Re-ranking simple mode</cell></row><row><cell>run4</cell><cell>French</cell><cell>Harmonic Re-ranking using Should</cell></row><row><cell>run5</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 7</head><label>7</label><figDesc>two way ANOVA Results for short-term nDCG</figDesc><table><row><cell>Source</cell><cell>df</cell><cell>SS</cell><cell>MS</cell><cell>F</cell><cell>PR(&gt;F)</cell></row><row><cell>Columns(Systems)</cell><cell>4.0</cell><cell cols="4">0.2785 0.0696 21.3414 3.594524e-17</cell></row><row><cell>Rows(Topics)</cell><cell cols="4">403.0 82.8188 0.2055 62.9957</cell><cell>0</cell></row><row><cell>Error</cell><cell cols="3">1612.0 5.2587 0.0033</cell><cell>-</cell><cell>-</cell></row><row><cell>Total</cell><cell cols="2">2019.0 88.3559</cell><cell>-</cell><cell>-</cell><cell>-</cell></row><row><cell>Table 8</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">two way ANOVA Results for short-term Map</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Source</cell><cell>df</cell><cell>SS</cell><cell>MS</cell><cell>F</cell><cell>PR(&gt;F)</cell></row><row><cell>Columns(Systems)</cell><cell>4.0</cell><cell cols="4">0.3358 0.0840 24.6253 8.216775e-20</cell></row><row><cell>Rows(Topics)</cell><cell cols="4">403.0 67.4513 0.1674 49.0915</cell><cell>0</cell></row><row><cell>Error</cell><cell cols="3">1612.0 5.4960 0.0034</cell><cell>-</cell><cell>-</cell></row><row><cell>Total</cell><cell cols="2">2019.0 73.2831</cell><cell>-</cell><cell>-</cell><cell>-</cell></row><row><cell>Table 9</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">two way ANOVA Results for long-term nDCG</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Source</cell><cell>df</cell><cell>SS</cell><cell>MS</cell><cell>F</cell><cell>PR(&gt;F)</cell></row><row><cell>Columns(Systems)</cell><cell>4.0</cell><cell>1.2535</cell><cell cols="3">0.3134 11.8346 1.4110e-09</cell></row><row><cell>Rows(Topics)</cell><cell cols="5">1511.0 103.3381 0.0684 2.5829 1.8578e-142</cell></row><row><cell>Error</cell><cell cols="3">6044.0 160.0363 0.0265</cell><cell>-</cell><cell>-</cell></row><row><cell>Total</cell><cell cols="2">7559.0 264.6278</cell><cell>-</cell><cell>-</cell><cell>-</cell></row><row><cell>Table 10</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">two way ANOVA Results for long-term Map</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Source</cell><cell>df</cell><cell>SS</cell><cell>MS</cell><cell>F</cell><cell>PR(&gt;F)</cell></row><row><cell>Columns(Systems)</cell><cell>4.0</cell><cell>1.4099</cell><cell cols="3">0.3525 19.2277 9.8898e-16</cell></row><row><cell>Rows(Topics)</cell><cell cols="5">1511.0 71.7684 0.0475 2.5909 1.9168e-143</cell></row><row><cell>Error</cell><cell cols="3">6044.0 110.7997 0.0183</cell><cell>-</cell><cell>-</cell></row><row><cell>Total</cell><cell cols="2">7559.0 183.9780</cell><cell>-</cell><cell>-</cell><cell>-</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><surname>Qwant</surname></persName>
		</author>
		<ptr target="https://about.qwant.com/en/" />
		<title level="m">About qwant</title>
				<imprint>
			<date type="published" when="2023-05-20">2023. 2023-05-20</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">G R</forename><surname>Deveaud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gonzalez-Saez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mulhem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Piroi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Popel</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2303.03229</idno>
		<title level="m">Longevalretrieval: French-english dynamic test collection for continuous web search evaluation</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Seupd@ clef: Team close on temporal persistence of ir systems&apos; performance</title>
		<author>
			<persName><forename type="first">G</forename><surname>Antolini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Boscolo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cazzaro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Martinelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Safavi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Shami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CEUR WORKSHOP PROCEEDINGS</title>
				<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">3497</biblScope>
			<biblScope unit="page" from="2368" to="2395" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Seupd@ clef: Team faderic on a query expansion and reranking approach for the longeval task</title>
		<author>
			<persName><forename type="first">E</forename><surname>Bolzonello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Marchiori</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Moschetta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Trevisiol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Zanini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CEUR WORKSHOP PROCEEDINGS</title>
				<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">3497</biblScope>
			<biblScope unit="page" from="2252" to="2280" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Günther</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Mohr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Abdessalem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Abel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">K</forename><surname>Akram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Guzman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mastrapas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sturua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Werk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Xiao</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.19923</idno>
		<title level="m">Jina embeddings 2: 8192-token general-purpose text embeddings for long documents</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Lucene</surname></persName>
		</author>
		<ptr target="https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html" />
		<title level="m">Standardtokenizer</title>
				<imprint>
			<date type="published" when="2024-05-20">2024. 2024-05-20</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Lucene</surname></persName>
		</author>
		<ptr target="https://lucene.apache.org/core/6_2_0/analyzers-common/org/apache/lucene/analysis/fr/FrenchLightStemmer.html" />
		<title level="m">Frenchlightstemmer</title>
				<imprint>
			<date type="published" when="2024-04-20">2024. 2024-04-20</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">A</forename><surname>Lucene</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Luke</forename></persName>
		</author>
		<ptr target="https://lucene.apache.org/core/8_11_0/luke/index.html" />
		<imprint>
			<date type="published" when="2024-05-20">2024. 2024-05-20</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Lucene</surname></persName>
		</author>
		<ptr target="https://lucene.apache.org/" />
		<title level="m">Apache lucene</title>
				<imprint>
			<date type="published" when="2023-05-20">2023. 2023-05-20</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">Frenchkagglestoplist</forename><surname>Kaggle</surname></persName>
		</author>
		<ptr target="https://www.kaggle.com/datasets/heeraldedhia/stop-words-in-28-languages?select=french.txt,????" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Lucene</surname></persName>
		</author>
		<ptr target="https://lucene.apache.org/core/7_3_1/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html" />
		<title level="m">Lucene elisionfilter</title>
				<imprint>
			<date type="published" when="2023-04-20">2024. 2023-04-20</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Lucene</surname></persName>
		</author>
		<ptr target="https://lucene.apache.org/core/7_0_1/core/org/apache/lucene/search/similarities/BM25Similarity.html" />
		<title level="m">Lucene bm25similarity</title>
				<imprint>
			<date type="published" when="2024-04-20">2024. 2024-04-20</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<ptr target="https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md" />
		<title level="m">Llama 3 model card</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note>AI@Meta</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title/>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Q</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sablayrolles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mensch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bamford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Chaplot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>De Las Casas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Bressand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lengyel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lample</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Saulnier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">R</forename><surname>Lavaud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-A</forename><surname>Lachaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Stock</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">L</forename><surname>Scao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lavril</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lacroix</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">E</forename><surname>Sayed</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.06825</idno>
	</analytic>
	<monogr>
		<title level="j">Mistral</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Lucene</surname></persName>
		</author>
		<ptr target="https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/search/BoostQuery.html" />
		<title level="m">Lucene boostquery</title>
				<imprint>
			<date type="published" when="2024-04-20">2024. 2024-04-20</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Sentence-bert: Sentence embeddings using siamese bert-networks</title>
		<author>
			<persName><forename type="first">N</forename><surname>Reimers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gurevych</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/1908.10084" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<meeting>the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Günther</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Mohr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Abdessalem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Abel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">K</forename><surname>Akram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Guzman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mastrapas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sturua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Werk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Xiao</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.19923</idno>
		<title level="m">Jina embeddings 2: 8192-token general-purpose text embeddings for long documents</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Faysse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fernandes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">M</forename><surname>Guerreiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Loison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Alves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Corro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Boizard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Alves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">H</forename><surname>Martins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">B</forename><surname>Casademunt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Yvon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">F T</forename><surname>Martins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Viaud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hudelot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Colombo</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2402.00786</idno>
		<title level="m">Croissantllm: A truly bilingual french-english language model</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Foundation</surname></persName>
		</author>
		<ptr target="https://solr.apache.org/guide/6_6/language-analysis.html#LanguageAnalysis-FrenchLightStemFilter" />
		<title level="m">Apache solr frenchlightstemfilter</title>
				<imprint>
			<date type="published" when="2024-04-20">2024. 2024-04-20</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Clive</surname></persName>
		</author>
		<ptr target="https://huggingface.co/jordiclive/flan-t5-3b-summarizer,apache2.0" />
		<title level="m">Multi-purpose summarizer (fine-tuned google/flan-t5-xl on several summarization datasets</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note>and BSD-3-Clause License. Fine-tuned on various summarization datasets including xsum, wikihow, cnn_dailymail/3.0.0, samsum, scitldr/AIC, billsum, TLDR. Designed for academic and general usage with control over summary type by varying the instruction prepended to the source document</note>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Muennighoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tazi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Magne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Reimers</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2210.07316</idno>
		<idno type="arXiv">arXiv:2210.07316</idno>
		<ptr target="https://arxiv.org/abs/2210.07316.doi:10.48550/ARXIV.2210.07316" />
		<title level="m">Mteb: Massive text embedding benchmark</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
