<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SEUPD@CLEF: Team MOUSE on Enhancing Search Engines Efectiveness with Large Language Models.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorenzo Cazzador</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Luigi De Faveri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filippo Franceschini</string-name>
          <email>P@10</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Pamio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samuel Piron</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The deterioration of the performances of Information Retrieval Systems (IRSs) over time remains an open issue among the Information Retrieval (IR) community. With this study for Task 1 of the Longitudinal Evaluation of Model Performance LAB (LongEval) at Conference and Labs of the Evaluation Forum (CLEF) 2024, we aim to propose and analyze the performance of an IRS that is able to handle changes over time in the data. In addition, the model uses diferent Large Language Models (LLMs) to enhance the efectiveness of the retrieval process by rephrasing the queries for the search and the reranking of the retrieved documents. With an in-depth analysis of the performance of the MOUSE group Retrieval System, using the datasets provided by the organisers of CLEF, the proposed model was able to reach a Mean Average Precision (MAP) of 0.22 and a Normalized Discounted Cumulated Gain (nDCG) of 0.40 for the English collection, increasing the performance for the original French collection up to 0.31 and 0.50, for MAP and nDCG respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Retrieval</kwd>
        <kwd>Search Engines</kwd>
        <kwd>Query Expansion</kwd>
        <kwd>Reranking</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The paper is organized as follows: Section 2 briefly introduces some related works for past LongEval
tasks at CLEF 23; Section 3 describes our approach; Section 4 explains our experimental setup; Section 5
discusses our main findings; finally, Section 6 draws some conclusions and outlooks for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>Section 2 describes related works used during the implementation of the MOUSE system. Subsection 2.1
proposes a walk-through of two of the studies that have faced the problem of Longitudinal Evaluation
of IRSs at the CLEF 23 LongEval Laboratory.</p>
      <p>
        The implementation of our IRS uses as base code the one proposed during the Search Engines course
in the academic year 2023/24. The classes implemented in Java provided during the course were
ParsedDocument, DocumentParser, TipsterParser, DirectoryIndexer, BodyField, TopicsReader, Searcher, and
they were used for the Tipster collection [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Our implementation uses these classes as a starting point
and is described in depth in Section 3.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Past LongEval research</title>
        <p>
          Prior research in the field of IR has explored diferent methods for evaluating the performance of an
IRS considering changes over time. In their study, Antolini et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] presented a pipeline for document
processing that involves the elimination of irrelevant scripts and advertisements, followed by an analysis
utilizing various techniques such as stemming and stopword filtering. After that, the documents are
then indexed for eficient retrieval. As in our study, the queries are expanded using the ChatGPT 3.5
Turbo [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] model to enhance the performance before the searching step. We assumed a similar approach,
changing the model used for the query expansion process, adopting Open Source models with higher
performance if compared to the ChatGPT 3.5 Turbo1.
        </p>
        <p>
          Diferently, Bolzonello et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] used pre-trained Language Model (LM) to rerank the results obtained
by searching documents with BM25. The authors used a T5-base [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and a Bert-base [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] models,
achieving an improvement of 3.5% and 8.5% on MAP and nDCG respectively. Through the inclusion of
LLMs in our query expansion process, we were able to top the results of Bolzonello et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] without
requiring any additional reranking.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this section of the paper, we describe the methodology adopted, with a focus on the architecture of
our system, implemented using the Apache Java Lucene library along with other Python code for the
query expansion step. Figure 1 provides the traditional Apache Lucene system.</p>
      <p>We also provide a general overview of the workflow implemented, Figure 2. We provide a brief
explanation of the main phases of the process:
• Parsing and Analyzing: the text sanitization of queries and documents conducted to extract
relevant information and remove useless noise. The raw input query and documents are tokenized,
stemmed and filtered in order to remove useless code, HTML tags and special characters, such as
emoji.
• Indexing: each document undergoes an indexing process, retaining only essential information.</p>
      <p>Indexed documents contain an ID field with the document’s identifier in the collection and a
content field comprising the entire body of the document.
• Query Expansion: reformulating the queries with a LLM to broaden the scope of the user query.</p>
      <p>By supplementing original query terms with additional and contextually relevant terms, the
efectiveness of the topic retrieval capability is increased.</p>
      <sec id="sec-3-1">
        <title>Query</title>
      </sec>
      <sec id="sec-3-2">
        <title>Indexing</title>
      </sec>
      <sec id="sec-3-3">
        <title>Query</title>
      </sec>
      <sec id="sec-3-4">
        <title>Representation</title>
      </sec>
      <sec id="sec-3-5">
        <title>Matching</title>
      </sec>
      <sec id="sec-3-6">
        <title>Retrieved</title>
      </sec>
      <sec id="sec-3-7">
        <title>Documents</title>
      </sec>
      <sec id="sec-3-8">
        <title>Documents</title>
      </sec>
      <sec id="sec-3-9">
        <title>Indexing</title>
      </sec>
      <sec id="sec-3-10">
        <title>Documents</title>
      </sec>
      <sec id="sec-3-11">
        <title>Representation</title>
        <p>• Reranking: sorting the obtained retrieved document list considering new scores provided by a
Transformer based model. By rearranging the retrieved documents, the reranking phase enhances
the user’s search, presenting more relevant results at the top of the list.</p>
        <sec id="sec-3-11-1">
          <title>3.1. Parser</title>
          <p>Firstly, we downloaded the data from the LongEval data website2. Then, before starting the system
implementation, we inspected the documents provided by CLEF LongEval 2024 organizers. During this
initial phase, we conducted a thorough analysis of the data to gain a deeper understanding of how to
optimize the parsing of documents and queries. This was a critical step in ensuring the efectiveness of
the search processes and improving the overall performance.</p>
          <p>The first phase of the workflow consisted of parsing the documents present in the collection; at the
end of the parsing process, unnecessary noise is removed from each document. Three main Java classes
were implemented to help this process:
• DocumentParser: This Java class is helpful in reading diferent types of documents and working
with their content.
• MouseParserJson: This Java class is a parser for documents provided in JSON format for those
supplied by CLEF LongEval 2024 organizers.</p>
        </sec>
      </sec>
      <sec id="sec-3-12">
        <title>Reranked</title>
      </sec>
      <sec id="sec-3-13">
        <title>Documents</title>
        <p>PyGaggle</p>
      </sec>
      <sec id="sec-3-14">
        <title>Reranking Information need</title>
      </sec>
      <sec id="sec-3-15">
        <title>Query</title>
      </sec>
      <sec id="sec-3-16">
        <title>Indexing</title>
      </sec>
      <sec id="sec-3-17">
        <title>Query</title>
      </sec>
      <sec id="sec-3-18">
        <title>Representation</title>
      </sec>
      <sec id="sec-3-19">
        <title>Query</title>
      </sec>
      <sec id="sec-3-20">
        <title>Expansion</title>
      </sec>
      <sec id="sec-3-21">
        <title>Matching</title>
      </sec>
      <sec id="sec-3-22">
        <title>Retrieved</title>
      </sec>
      <sec id="sec-3-23">
        <title>Documents</title>
        <p>• ParsedDocument: This Java class represents the parsed document ready to be indexed. The
class collects two fields: the id and the body, i.e., the document’s content comprising the text to
be parsed.</p>
        <p>Following implementation, we executed the aforementioned classes and stored the resulting data
for the subsequent retrieval phase, which encompasses the analyzer step. During this phase, the
MouseParserJson class was utilized to iterate over the input documents, while the ParsedDocuments class
was employed to represent the document that required indexing.</p>
        <sec id="sec-3-23-1">
          <title>3.2. Analyzer</title>
          <p>After the parsing phase, we want to process and manipulate texts that come from queries and documents.
We first decided to test the default Lucene Analyzer 3, and then we implemented a custom version of it:
the MouseAnalyzer. This class has been used to manipulate words before the retrieval process.</p>
          <p>Before explaining the analyzer, we need to explain the class MouseParams used by the Analyzer briefly.
Such a class helps initialize the parameters the analyzer uses. Furthermore, we defined a TokenizerType
attribute to specify which kind of tokenizer the system should use:
• Whitespace: the WhitespaceTokenizer is a tokenizer from the Lucene Library4 that divides text
at whitespace characters as defined by the method isWhitespace5.
3https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/analysis/Analyzer.html
4https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html
5https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html?is-external=true#isWhitespace-int• Letter: the LetterTokenizer is a tokenizer from the Lucene Library6 that divides text at non-letters.
• Standard: the StandardTokenizer is a tokenizer that divides text according to the Word Break rules
from the Unicode Text Segmentation algorithm [10], which considers numbers and acronyms as
separate tokens.</p>
          <p>The file also contains a StemFilterType attribute, which represents the Stemming filter to use, with
possible values comprising the following typology:
• EnglishMinimal: A minimal stemmer for English considering only plural words7.
• Porter: An implementation of the Porter stemming Algorithm [11].
• K: An implementation of the Krovetz stemming Algorithm [12], by using the native class
KStem</p>
          <p>Filter in the Lucene library8.
• Lovins: An implementation of the Lovins stemming Algorithm [13].
• SnowBall: An implementation of the Snowball stemmer, a native class in the Lucene library9.
• French: An implementation of the "UniNE" algorithm [14].</p>
          <p>• None: This flag says that no stemmer should be used.</p>
          <p>CLEF LongEval 2024 organizers provided a collection of documents in English and French. We chose
the best parameters for each language in our system based on the diferent trial run results obtained.
The French documents were processed with:
• Tokenizing: the StandardTokenizer had the best overall performance.
• Stemming: the StemFilterType used in the system was the French one.
• Stopword Removal: the StopFilter is used to remove common and frequent words (e.g. "le",
"et", and "à") from the text. In our system, we tried some stoplists with diferent lengths and
words; the best we found was stopwords-fr.txt and train24-top125-nominmaxlen.txt. The first list
of stopwords was provided by the Open Source project for stopwords removal [15]; the second
list was computed by obtaining the first 125 terms with the higher frequency in the collection
analyzing the index of the documents without applying any stemming or a stoplist. All the lists
are available in the MOUSE repository10.
• Elision Removal: the ElisionFilter was implemented in this system. Elisions in French, such
as the contractions found in phrases (e.g. "aujourd’hui" or "qu’il"), were identified and handled
appropriately to maintain the integrity of words in the analysis.</p>
          <p>For the English documents, instead, were used:
• Tokenizing: the StandardTokenizer had the best overall performance.
• Stemming: the StemFilterType used in the system was the Porter one.
• Stopword Removal: the StopFilter is used to remove common and frequent words (e.g. "the",
"an", and "that") from the text. In this case, we tested diferent stoplists, and as the final lists we
decided to use for the final runs the stopwords-en.txt and train24-top125-nominmaxlen.txt.
• Possessive Removal: the EnglishPossessiveFilter was implemented in this system. It removes the
possessives (’s) from words.</p>
        </sec>
        <sec id="sec-3-23-2">
          <title>3.3. Searcher</title>
          <p>The searching procedure starts when a user submits a query to the system, which then analyzes the
query and searches through indexed documents to find relevant information. The system aims to retrieve
and rank documents that align with the user’s query, returning a list of results that best match the
user’s information needs, ordered by relevance. In this section, we explain some searching techniques
that we implemented in our methodology.
6https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LetterTokenizer.html
7https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/EnglishMinimalStemmer.html
8https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html
9https://lucene.apache.org/core/9_10_0/analysis/common/org/tartarus/snowball/SnowballStemmer.html
10https://bitbucket.org/upd-dei-stud-prj/seupd2324-mouse/src/master/
3.3.1. BM25
For evaluating the matching between parsed and analyzed query  = (1, . . . , ) and document ,
we used the Okapi BM25 [16, 17] scoring function. In Equation 1 the scoring function is reported, where
 (, ) is the frequency that a query term  appears in the document , IDF is the Inverse Document
Frequency (IDF) of the query terms [18], avgdl is the average document length in the text collection
from which documents are taken from. In implementing our system, we employed the Okapi BM25
retrieval system ofered by Lucene 11. Moreover, we used the default settings provided in the Lucene
library, i.e.,  = 1.2 and  = 0.75. Accordingly, we found that the BM25 scoring function performed
efectively without tuning these default parameters.</p>
          <p>score(, ) = ∑︁ IDF() ·

=1</p>
          <p>(, ) · (1 + 1)
 (, ) + 1 · 1
︁(
−  +  · a|vgd|l )︁
(1)
3.3.2. Fuzzy
A fuzzy query is a type of search query that finds matches even when the search terms do not exactly
match the terms in the documents. This helpful approach is used to get a query’s (partial) match to
a document, even if there are some variations, e.g., words misspelt, abbreviations and typographical
errors. Fuzzy is a powerful Lucene Library12 established on the similarity between terms: such similarity
measurement is based on the Damerau-Levenshtein algorithm [19], which is a method for computing the
closeness between two strings, which takes into account the number of insertion, deletion, substitution,
and transposition operations needed to transform one string into the other.</p>
          <p>During the first analysis of the queries collection, we noticed that some of them contained words with
3.3.3. Spell-Checker
grammatical errors:</p>
          <p>Wrong Queries:
• emploi terriotrial
• espace sheingen</p>
          <p>Corrected Queries:
• emploi territorial
• espace schengen
3.3.4. Word N-Grams</p>
          <p>We implemented the Spell-Checker Lucene Library13 to fix grammatical errors in the queries. Such
a method is used inside the Searcher class, creating when the system encounters a token that is not
present in the Indexer. We inspected the dictionary provided to see if the wrong word could be replaced
with a correct version of it (or with a similar one). Thus, the queries become:
The Searcher, 3.3, also implements the ShingleFilter, a special filter that constructs shingles, i.e., n-grams
tokens from a given stream of terms. They are a special sentence technique analysis that divides words
of a sentence into sequences of n consecutive words; they improve search relevance.</p>
          <p>For example, if we have a sentence like "Let’s try it" we can have "let’s try, try it, let’s it" and we see
that the sentence was divided into three shingles. We handled the base case in which the query has only
11https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/similarities/BM25Similarity.html
12https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/FuzzyQuery.html
13https://lucene.apache.org/core/9_10_0/suggest/org/apache/lucene/search/spell/SpellChecker.html
one word, and then the shingle is the query itself. Specifically, we set two variables: the maxShingleSize
represents the maximum number of shingles a query should be divided into, and shingleProximity that
is a measure of the distance of terms in shingles used to identify documents in which the search term
appears. We tested with and without this filter and saw a remarkable increment in the MAP. The Word
N-grams method is applied for queries that contain more than a single word, creating overlapping
sequences of terms.</p>
        </sec>
        <sec id="sec-3-23-3">
          <title>3.4. Query Expansion</title>
          <p>
            To improve the efectiveness of the queries, we adopted a query boosting methodology based on LLM,
as previously done by [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. Figure 3 represents the pipeline used.
          </p>
          <p>groq API</p>
        </sec>
      </sec>
      <sec id="sec-3-24">
        <title>Query</title>
        <p>We wrote the python script queryExpansion.py to interact with the groq API14 and the models
ofered by the organization in their cloud infrastructure, i.e., LLaMA [ 20] models (LLaMA3 7b and
LLaMA2 70b), Mixtral 8x7b [21], and Gemma 7b [22], deployed using the version provided by the
Hugging Face platform15. To have access to the models, we generated a GROQ_API_KEY in the groq
cloud platform and saved it as a configuration file to be accessed by the script. Since we used the Free
Beta program of the service, we worked under some limitations: the number of possible requests per
minute is limited to 30, and we could not fine-tune any model. Nevertheless, to exploit the capabilities
of the LLMs used, we tried diferent contextualizations in the prompt submitted, identifying as most
efective the following, written in French (translating the English original prompt with the help of the
Google Translation service).</p>
        <p>Prompt:
"Extend the following query with 20 related terms or expressions in French: QUERY. Only print
terms on a single line, separated by commas, do not add additional text or explanations."
French translation:
"Étendre la requête suivante avec 20 termes ou expressions liés en French: QUERY. Imprimez
uniquement les termes sur une seule ligne, séparés par des virgules, n’ajoutez pas de texte ou
d’explications supplémentaires."</p>
        <p>Hence, the script queryExpansion.py, takes as input the path to the query dataset path, passed
with the "-d" flag, and the model name " -m" to be used, storing the final expanded queries in a .tsv file
to be used for the retrieval phase. The communication between the localhost and the model used in the
cloud respects the API limit by resting 60 seconds every 30 queries submitted in order to avoid the "429
TooManyRequests" HTTP Error. The results are saved in the predefined directory, storing the query
id, original query, expanded query, and model used for the expansion.
14https://github.com/groq/groq-python
15https://huggingface.co/</p>
        <sec id="sec-3-24-1">
          <title>3.5. Reranking</title>
          <p>For such a task, we tried three diferent approaches: the first was provided by connecting with the
Cohere API16 to use LLMs for performing the reranking scoring computation. Then, we used the
Pygaggle Library17 for interacting with deep neural architectures for text ranking, designed in Python.
We also tried the rerank model ms-marco-MultiBERT-L-12, with the Multi-lingual support provided
by the Python module FlashRank18. Finally, we found that the first two approaches gave us the best
performances and decided to discard the third one to avoid encumbering.</p>
          <p>Furthermore, the Cohere approach, thanks to the API service, gives us the opportunity to rerank
up to 100 documents; on the other hand, Pygaggle runs the model on the local machine and, with our
computational power, we were able to rerank approximately 20 documents per run. We wrote a simple
Python script for each of these approaches with two functions: load_ranker and rerank. The former is
called inside the constructor of the Searcher class, loading the model and saving it into a global variable.
After that, the method rerank is called in the searching process after retrieving documents. This method
receives the query title and the contents of documents as parameters to perform the reranking.</p>
          <p>
            Once the scores have been computed, we used the normalization function proposed by Bolzonello
et al. [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], Equation 2 to create the final ranking.
          </p>
          <p>︂(
nScore() =</p>
          <p>Score() +
∈[1,]
min Score() · Score(1)
︂)</p>
          <p>
            ScoreBM25(1)
where ScoreBM25() is the score given by BM25 for the document at rank position  and Score() is
the score computed by the reranker for the document at rank position , while  is the total number of
documents reranked. The ratio SSccoorereBM2(51(1)) preserves the score computed by the first retrieved document
from the BM25 IRS. Finally, as done by Bolzonello et al. [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], the final score is computed, Equation 3.
          </p>
          <p>ifnalScore () = mntr + (1 −  ) · Score25() +  · nScore()
where mntr is the maximum score of docs that are not reranked, preserving the order of the leftover
docs. For the experiments, we used</p>
          <p>The data presented in Table 1 outlines the models utilized for query expansion and reranking, as well
as other information on the runs submitted to CLEF24 LongEval.</p>
          <p>Summary of parameters for the runs submitted to CLEF 24 LongEval.</p>
          <p>Run 2</p>
          <p>Run 7
Porter
Standard
Llama3-70b
Cohere-100-w06
French-Light
Standard
Llama3-70b
Cohere-100-w06</p>
          <p>English
French</p>
          <p>Run 3</p>
          <p>Run 8
Porter
Standard
Mixtral-8x7b
Pygaggle-Luyu-20-w06
French-Light
Standard
Mixtral-8x7b</p>
          <p>Pygaggle-Luyu-20-w06
train24-top125-nominmaxlen.txt train24-top125-nominmaxlen.txt train24-top125-nominmaxlen.txt stopwords-en.txt
train24-top125-nominmaxlen.txt train24-top125-nominmaxlen.txt train24-top125-nominmaxlen.txt stopwords-fr.txt
Query Expansion Model Llama3-70b
Reranking Model</p>
          <p>Pygaggle-Luyu-20-w06
Query Expansion Model Llama3-70b
Reranking Model</p>
          <p>Pygaggle-Luyu-20-w06
Porter
Standard
French-Light
Standard</p>
          <p>Run 1</p>
          <p>Run 6</p>
          <p>Run 4
Porter
Standard
Llama3-70b</p>
          <p>Run 9
French-Light
Standard
Llama3-70b</p>
          <p>Run 5
Porter
Standard
Mixtral-8x7b</p>
          <p>Run 10
French-Light
Standard
Mixtral-8x7b</p>
        </sec>
      </sec>
      <sec id="sec-3-25">
        <title>Query BM25 Documents</title>
        <p>1
2
3</p>
      </sec>
      <sec id="sec-3-26">
        <title>Retrieved</title>
      </sec>
      <sec id="sec-3-27">
        <title>Documents</title>
      </sec>
      <sec id="sec-3-28">
        <title>Pygaggle</title>
        <p>3
4
2
#1</p>
      </sec>
      <sec id="sec-3-29">
        <title>Reranked</title>
      </sec>
      <sec id="sec-3-30">
        <title>Documents</title>
        <p>#4
#3
#2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>The experimental setup of the project consists of the following:
• The project source code is available in the Mouse group repository in Bitbucket:
https://bitbucket.org/upd-dei-stud-prj/seupd2324-mouse/src/master/.
• The collection used was provided by the CLEF LongEval 2024 Organizers at:
https://researchdata.tuwien.at/records/y60e9-k9b51.
• The evaluation tool used is the trec_eval script (provided during the course and present in the
repository).
• To compute the runs, we used the following hardware:
– PC 1 - French: Microsoft Windows 11 Home, CPU 13th Gen. Intel i9 (20@5.2GHz), GPU</p>
      <p>Intel Raptor Lake-P [Iris Xe Graphics].
– PC 2 - English: Pop!_OS 22.04 LTS, CPU 11th Gen. Intel i7 (8@4.7GHz), GPU Intel
TigerLake</p>
      <p>LP GT2 [Iris Xe Graphics].</p>
      <p>We provided a README file in the group repository with all the instructions for reproducibility.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>In this Section, we present the experimental results obtained and discuss the findings. We structured
the presentation of the results by dividing the discussion based on the language of the collection used,
i.e., English and French.</p>
      <sec id="sec-5-1">
        <title>5.1. Training Results</title>
        <p>We report the results obtained using the implemented system. For clarity reasons, we decided to
structure the name of the models used for the runs deciding the such names on the parameter values
reported in Table 1, displaying the names in a common structure:</p>
        <p>Model: StemFilter_Tokenizer_Stoplist_QueryExpansionModel_RerankingModel
English Collection Results. Table 2 displays the MAP and nDCG scores for the runs conducted
on the English data collection. As outlined in Section 3, the English dataset was generated through
automated translation of the original French queries. Consequently, we noted that the MAP and the
nDCG were afected by the quality of these translations. Furthermore, it appears that using query
expansion and reranking techniques based on LLMs is not successful enough in resolving the issue of
poorly translated queries and documents.
nDCG</p>
        <p>Figure 5 illustrates the interpolated Precision-Recall curve, which can be advantageous in
understanding the inverse relationship between the two measures. Hence, the graph highlights the importance of
balancing the fraction of retrieved documents that are relevant to the user, the Precision of the system,
and the efectiveness of the retrieval system in obtaining all pertinent documents, i.e., the Recall. The
curve indicates that the model employing the LLama3-70b and Cohere models for query expansion and
reranking, respectively, generally provides a superior Precision Recall trade-of if compared to other
models. Conversely, the absence of reranking and stopwords removal appears to be disadvantageous for the
system, cf. models Porter_Standard_stopwords-en.txt_LLama3-70b and Porter_Standard_Mixtral-8x7b.
0.0
0.2
0.4</p>
        <p>0.6
Recall
0.8
1.0</p>
        <p>French Collection Results. Using the original French queries positively impacted the system’s
performance in terms of both MAP and nDCG. Table 3 describes the systems’ performance for the
0.4
n
o
i
s
i
c
re0.3
P
d
e
t
a
l
o
p
re0.2
t
n
I</p>
        <p>Porter_Standard_train24-top125-nominmaxlen.txt_LLama3-70b_Pygaggle-Luyu-20-w06
Porter_Standard_train24-top125-nominmaxlen.txt_LLama3-70b_Cohere-100-w06
Porter_Standard_train24-top125-nominmaxlen.txt_Mixtral-8x7b_Pygaggle-Luyu-20-w06
Porter_Standard_stopwords-en.txt_LLama3-70b</p>
        <p>Porter_Standard_Mixtral-8x7b
diferent runs using the French collection. The French runs yielded a considerable improvement in
the models, with each of them increasing the MAP and nDCG by at least 7% and 9%, respectively.
Once again the worst results were presented by the model with no stopwords filtration and reranking,
decreasing the best model performance of 5% in terms of MAP and 4% in terms of nDCG. However,
such a model reaches better performance if compared to the best model of Table 2, which used the
English collection data.
French-Light_Standard_train24-top125-nominmaxlen.txt_LLama3-70b_Pygaggle-Luyu-20-w06
French-Light_Standard_train24-top125-nominmaxlen.txt_LLama3-70b_Cohere-100-w06
French-Light_Standard_train24-top125-nominmaxlen.txt_Mixtral-8x7b_Pygaggle-Luyu-20-w06
French-Light_Standard_stopwords-fr.txt_LLama3-70b
French-Light_Standard_Mixtral-8x7b
MAP
nDCG</p>
        <p>Finally, we also present the Interpolated Precision-Recall curve derived from the French dataset. Once
again, we note a consistent pattern of the Precision-Recall trade-of, highlighting the importance of
stopwords removal for this collection before starting the search process. It has been observed that the
integration of query expansion and reranking, based on LLMs, results in a consistent and outstanding
performance across the diferent models used.</p>
        <p>0.6
0.5
n
o
i
s
i
ce0.4
r
P
d
e
t
a
l
op0.3
r
e
t
n
I
0.1</p>
        <p>French-Light_Standard_train24-top125-nominmaxlen.txt_LLama3-70b_Pygaggle-Luyu-20-w06
French-Light_Standard_train24-top125-nominmaxlen.txt_LLama3-70b_Cohere-100-w06
French-Light_Standard_train24-top125-nominmaxlen.txt_Mixtral-8x7b_Pygaggle-Luyu-20-w06
French-Light_Standard_stopwords-fr.txt_LLama3-70b</p>
        <p>French-Light_Standard_Mixtral-8x7b
0.0
0.2
0.4</p>
        <p>0.6
Recall
0.8
1.0
5.1.1. Discussion
Observing the results obtained, we derived that the English collection, composed of automatically
translated queries and documents, yielded poor performance outcomes. This lack of efectiveness can
be attributed to the automatic translation process, which often fails to translate the query correctly
and alters the user’s real information needs. In second place, the adoption of a stopwords list had an
important impact on the systems’ performance: instances where irrelevant words were not removed,
did not yield meaningful results, contributing to poor performance even when queries were expanded.
As a matter of fact, from the training results, the influence of LLMs on query expansion and reranking
is significant in both analyzed scenarios. Across the experiments, the most efective models consistently
included query expansion and reranking pipelines based on LLama3, Mixtral, Cohere, and Pygaggle
models. These models increased the overall performance in terms of MAP, nDCG, and Precison-Recall
trade-of, showcasing the significant potential of leveraging advanced language models for enhancing
retrieval relevance.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Test Results</title>
        <p>This Section comprises the statistical tests performed to investigate the performance of the submitted
runs on the short and long-term collections. Specifically, we tackle the performance changes and the
statistical analysis of the results obtained. Moreover, the ANOVA 2 test is used with a significance
level of  = 0.05 to assess whether the null hypothesis can be rejected, thereby indicating a significant
statistical diference between the results of the given runs. Specifically, the two-way ANOVA is used
to determine how two independent variables—namely, the system and a query—impact a dependent
variable, which, in our case, is a specific measure. In each ANOVA2 table, we reported df,i.e., the degree
of freedom in the source, SS, i.e., the sum of squares due to the source, MS, i.e., the mean sum of squares
due to the source, F, i.e., the F-statistic and PR(&gt;F) that is the p-value. To avoid encumbering, we
report only meaningful insights on the measures collected, however, all the analysis and the graphs
are available on the study repository. Finally, we performed the pairwise comparison, i.e., the Tukey’s
Honestly Significant Diference ( HSD) test, to verify if, among our submitted systems, some significant
statistical diferences are present. For each multi-comparison graph, we highlighted in blue the best
model obtained, in red the models for which there is a significant diference computed by Tukey’s HSD
test. In gray, models that are not statistically diferent are reported.</p>
        <p>Since our submissions were both using English and French collections, for this final analysis we
decided not to remove any of the systems used during the training phase.
5.2.1. English Test Results: Short Term collection
Eng-Llama3-Cohere rerank
Eng-Llama3-Pygaggle rerank
Eng-Mixtral-Pygaggle rerank
Eng-Llama3-NoRerank
Eng-Mixtral-NoRerank
nDCG
0.3060
0.3038
0.3036
0.2914
0.2910
0.2031
0.1982
0.1992
0.1805
0.1807</p>
        <p>June
MAP
0.1705
0.1666
0.1664
0.1525
0.1527</p>
        <p>Table 4 reports the test results obtained by each of the submitted systems on the June test collection.
Upon comparing the results obtained during the training phase, cf. Table 2, it is evident that there is a
slight reduction in the efectiveness of the retrieval systems, particularly in terms of MAP and nDCG.
Based on the Boxplots obtained from Figure 7, it is evident that the model utilizing Llama3-Cohere
query expansion and reranking demonstrates ideal robustness in handling performance losses from
training to testing phases, preserving for some topics good results, as indicated by both nDCG@10 and
P@10 metrics. A final important remark on the boxplots is that all the systems present a similar trend
in terms of median and interquartile range, also showing the presence of outliers for specific queries.
0.7
0.2
0.0
ANOVA 2 The ANOVA 2 analysis was performed to investigate how diferent types of systems were
able to retrieve the relevant documents for the provided topics and, in addition, to understand how
diferent searched topics, i.e., the queries, performed across these systems. Table 5 shows the ANOVA2
tables of the nDCG@10 and P@10 on the June dataset. The insight that Table 5 is that the pvalue of the
test is below the threshold  = 0.05, implying that there is indeed a significant statistical diference
among the systems.
MS
0.1162
0.0303
0.0022</p>
        <p>F PR(&gt;F)
53.1366 0
13.8747 3.9426e-11
-
-
Tukey’s HSD Test The results of Tukey’s HSD multiple comparisons are reported in Figure 8. From
the pairwise test between the diferent systems, we can see that for the short-term dataset from the best
systems, i.e., the models that use the LLM for query expansion and reranking there are no significant
statistical diferences, for both nDCG@10 and P@10, while on the other hand, the diferences are shown
with the models that only uses query expansions models.
5.2.2. English Test Results: Long Term collection
Table 6 reports the test results obtained by each of the submitted systems on the test collection for
the English August collection. Also in this case, the results are sorted based on the nDCG measure,
implying that the model that achieved the best nDCG was the Llama3 Cohere model.</p>
        <p>From the box plots in Figure 9, once again, we can observe that for both measures, we obtained
comparable graphs in terms of median and inner quartiles, showing the presence of outliers topics
towards better performance. If we compare the results obtained in Figure 9, with the June test results,
Figure 7, what we can gain is the information that the model with the higher mean in terms of nDCG@10
and P@10 is the Llama3 Cohere model, demonstrating the robustness of the system in handling changes
through time.
0.2
0.0
0.1589</p>
        <p>ANOVA 2 Even in this case, by using as two independent variables the systems and the topics used
for the retrieval process, we wanted to investigate how such variables influenced the performances
for the nDCG@10 and the P@10. The ANOVA 2 table, Table 7, provides evidence of a statistically
significant diference between the models for all analyzed measures, indicating that the pvalue is lower
than the threshold  .
Topics
Systems
Residuals
Total
MS
0.0756
0.0583
0.0013</p>
        <p>F
56.3080
43.4200</p>
        <p>PR(&gt;F)</p>
        <p>0
5.5860e-36
Tukey’s HSD Test The graphs of Figure 10 represent the pairwise Tukey’s HSD test. For the
nDCG@10 measure, there is no significant statistical diference among the LLMs reranking-based
systems, while on the other hand, the models that do not implement any reranking strategy obtain
worse performance and they are found to be statistically diferent from the best model. However,
when it comes to P@10, the statistical diference is found also with the other two query expansion
and reranking LLMs based systems: this aspect underlines the importance and the positive impact that
query expansion and reranking performed with the latest versions of LLMs architecture have on the
search pipeline.
5.2.3. French Test Results: Short Term collection</p>
        <p>System
Fr-Llama3-Cohere rerank
Fr-Mixtral-Pygaggle rerank
Fr-Llama3-Pygaggle rerank
Fr-Llama3-NoRerank
Fr-Mixtral-NoRerank</p>
        <p>Table 8 reports the test results obtained on the June French test collection by the five submitted
systems. Comparing the results achieved in the training phase, cf. Table 3, it is clear that the efectiveness
of retrieval systems has slightly decreased. However, considerable values in terms of MAP and nDCG
have been achieved, especially compared to those obtained on the English collection. A possible
motivation for explaining the fact that the results of the French collection are superior to those obtained
from the English collection is because, in the former case, the original documents and queries were
used, while in the latter case, an automatic translation has been applied. As a consequence, in the latter
case, the results depend heavily on the quality of the translation.</p>
        <p>Upon examing the Boxplots depicted in Figure 11, it is possible to understand that the systems with
reranking outperformed those without this feature. In particular, the system that employed Llama3 as
the query expansion method and Cohere as the rerank model achieved, once again, the most favourable
results when compared to the others.
0.2
0.0
ANOVA 2 As previously done for the English June collection, we performed the Anova 2 test to gain
a deeper understanding of the topics and systems influence for achieving the computed values. In this
case, Table 9, the result was that pvalue less than  , thus implying the presence of statistical diference
in the analysis of systems and queries.
Tukey’s HSD Test The Tukey’s HSD test confirmed the hypothesis of the importance of the LLMs
query expansion and reranking process: in Figure 12, the best model, represented by the French Llama3
Cohere model, shows a statistical diference in the performance, nDCG@10 and P@10, with the Mixtral
NoRerank model. Moreover, the query expansion process performed with the Llama3 model underlines
the strength of the model to reach out to the performances of the reranking systems. A possible future
direction can be related to an extensive analysis toward more specific research in the adoption of a
certain LLM architecture over another for the expansion of a query and the reranking. We leave this
aspect as an open issue and possible future work.
5.2.4. French Test Results: Long Term collection</p>
        <p>System
Fr-Llama3-Pygaggle rerank
Fr-Llama3-Cohere rerank
Fr-Mixtral-Pygaggle rerank
Fr-Llama3-NoRerank
Fr-Mixtral-NoRerank</p>
        <p>Finally, we report the August results for the French collection. Table 10 reports the test results
obtained by each of the submitted systems on the test collection. Upon comparing these results with
those achieved in the training phase, once again, shown in Table 3, it is evident that the efectiveness of
the retrieval systems has experienced a slight decrease. However, notable values in terms of MAP and
nDCG have been attained, this time by the Llama3 Pygaggle model, particularly in comparison to those
obtained from the English collection. Once again we want to stress the fact that an underlying reason
for the better results of the French collections, as opposed to the English ones, may be attributed to the
usage of the original documents and queries in the former, while an automatic translation was applied
in the latter. Consequently, the quality of the translation significantly influences the results in the latter
case.</p>
        <p>Moreover, the box plots in Figure 13 show comparable results to the ones provided in the Sections
above. However, it is possible to understand that this time, the mean values of the P@10 for the models
with query expansion and reranking based on LLMs are below the medians, hence showing a negatively
skewed distribution of such results.</p>
        <p>ANOVA 2 Table 11 shows the results of the ANOVA 2 for nDCG@10 and P@10. In this case, there are
significant diferences as the pvalue obtained from the two-way test is below the threshold  = 0.05.
0.2
0.0</p>
        <p>Tukey’s HSD Test To conclude the statistical test analysis, we finally implemented a pairwise Tukey’s
HSD test to get important statistical diferences between the models. However, conversely to the case
analyzed for the June French collection, cf. Figure 12, the no reranking models present a significant
diference with the reranking models in term of nDCG@10, while the test on the P@10 measure
underlines once again the robustness of the Llama3 query expansion process showing no diferences
between the reranking with Cohere and Pygaggle.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>Our work proposed a method for Searching relevant documents based on query expansion and reranking
techniques that use the application of LLMs. Our system was able to achieve a MAP of 0.3127 and nDCG
of 0.5077 on the French data collection using LLama 3-70b and the Pygaggle-Luyu-20-w06 models for
query expansion and reranking.</p>
      <p>We observed a significant enhancement in our study when employing the French dataset over the
English counterpart. It appears that translations may introduce inaccuracies and inconsistencies,
impacting the overall data quality. Moreover, the application of query expansion and reranking based on
LLMs shows a promising potential for further investigation considering the results obtained.
Furthermore, further related works plan to enhance the efectiveness of our system, exploring the utilization
of next-generation LLMs for translation, query expansion and reranking, as they will ofer superior
capabilities compared to current models in task comprehension and text generation capabilities. Finally,
experimenting with various combinations of stoplists and stemming methods could yield even more
favourable outcomes, optimizing our system’s performance.
Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume
1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. URL:
https://doi.org/10.18653/v1/n19-1423. doi:10.18653/V1/N19-1423.
[10] J. Hadley, Unicode text segmentation, 2023. URL: https://unicode.org/reports/tr29/, accessed on</p>
      <p>May 3, 2024.
[11] M. F. Porter, An algorithm for sufix stripping, Program 14 (1980) 130–137.
[12] R. Krovetz, Viewing morphology as an inference process, in: R. R. Korfhage, E. M. Rasmussen,
P. Willett (Eds.), Proceedings of the 16th Annual International ACM-SIGIR Conference on Research
and Development in Information Retrieval. Pittsburgh, PA, USA, June 27 - July 1, 1993, ACM, 1993,
pp. 191–202. URL: https://doi.org/10.1145/160688.160718. doi:10.1145/160688.160718.
[13] J. B. Lovins, Development of a Stemming Algorithm, Mechanical Translation and Computational</p>
      <p>Linguistics 11 (1968) 22–31.
[14] J. Savoy, A stemming procedure and stopword list for general french corpora, Journal of the</p>
      <p>American Society for Information Science 50(10), 944-952. (2009).
[15] D. Gene, A. Suriyawongkul, M. Pukhalskyi, B. Solomon, Stopwords iso, 2020. URL: https://github.</p>
      <p>com/stopwords-iso/stopwords-iso, accessed on April 18, 2024.
[16] K. Spärck Jones, S. Walker, S. E. Robertson, A probabilistic model of information retrieval:
development and comparative experiments – Part 1, Information Processing &amp; Management 36
(2000) 779–808.
[17] K. Spärck Jones, S. Walker, S. E. Robertson, A probabilistic model of information retrieval:
development and comparative experiments – Part 2, Information Processing &amp; Management 36
(2000) 809–840.
[18] K. Spärck Jones, A statistical interpretation of term specificity and its application in retrieval,</p>
      <p>Journal of Documentation 28 (1972) 11–21.
[19] F. J. Damerau, A technique for computer detection and correction of spelling errors, Commun. ACM
7 (1964) 171–176. URL: https://doi.org/10.1145/363958.363994. doi:10.1145/363958.363994.
[20] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and eficient
foundation language models, CoRR abs/2302.13971 (2023). URL: https://doi.org/10.48550/arXiv.
2302.13971. doi:10.48550/ARXIV.2302.13971. arXiv:2302.13971.
[21] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot,
D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier,
M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril,
T. Wang, T. Lacroix, W. E. Sayed, Mixtral of experts, CoRR abs/2401.04088 (2024). URL: https:
//doi.org/10.48550/arXiv.2401.04088. doi:10.48550/ARXIV.2401.04088. arXiv:2401.04088.
[22] T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale,
J. Love, P. Tafti, L. Hussenot, A. Chowdhery, A. Roberts, A. Barua, A. Botev, A. Castro-Ros,
A. Slone, A. Héliou, A. Tacchetti, A. Bulanova, A. Paterson, B. Tsai, B. Shahriari, C. L. Lan, C. A.
Choquette-Choo, C. Crepy, D. Cer, D. Ippolito, D. Reid, E. Buchatskaya, E. Ni, E. Noland, G. Yan,
G. Tucker, G. Muraru, G. Rozhdestvenskiy, H. Michalewski, I. Tenney, I. Grishchenko, J. Austin,
J. Keeling, J. Labanowski, J. Lespiau, J. Stanway, J. Brennan, J. Chen, J. Ferret, J. Chiu, et al., Gemma:
Open models based on gemini research and technology, CoRR abs/2403.08295 (2024). URL: https:
//doi.org/10.48550/arXiv.2403.08295. doi:10.48550/ARXIV.2403.08295. arXiv:2403.08295.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscáková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deveaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. G.</given-names>
            <surname>Sáez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popel</surname>
          </string-name>
          , Longevalretrieval:
          <article-title>French-english dynamic test collection for continuous web search evaluation</article-title>
          , in: H.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <string-name>
            <surname>Duh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          <string-name>
            <surname>Kato</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Poblete (Eds.),
          <source>Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2023</year>
          , Taipei, Taiwan,
          <source>July 23-27</source>
          ,
          <year>2023</year>
          , ACM,
          <year>2023</year>
          , pp.
          <fpage>3086</fpage>
          -
          <lpage>3094</lpage>
          . URL: https://doi.org/10.1145/3539618.3591921. doi:
          <volume>10</volume>
          .1145/3539618.3591921.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bedrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. R.</given-names>
            <surname>Hersh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Soborof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>TREC-COVID: constructing a pandemic information retrieval test collection</article-title>
          ,
          <source>SIGIR Forum 54</source>
          (
          <year>2020</year>
          ) 1:
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          :
          <fpage>12</fpage>
          . URL: https://doi.org/10.1145/3451964.3451965. doi:
          <volume>10</volume>
          .1145/3451964. 3451965.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Qwant</surname>
          </string-name>
          , Qwant search engine,
          <year>2013</year>
          . URL: https://about.qwant.com/en/,
          <source>accessed on April 17</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Harman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liberman</surname>
          </string-name>
          , Tipster complete,
          <year>1993</year>
          . URL: https://catalog.ldc.upenn.edu/LDC93T3A, accessed on May 3,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Antolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Boscolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cazzaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Safavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <article-title>Seupd@clef: Team CLOSE on temporal persistence of IR systems' performance</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2023</year>
          ), Thessaloniki, Greece,
          <source>September 18th to 21st</source>
          ,
          <year>2023</year>
          , volume
          <volume>3497</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>2368</fpage>
          -
          <lpage>2395</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3497</volume>
          /paper-192.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>O. AI</surname>
          </string-name>
          , Text generation models openai,
          <year>2022</year>
          . URL: https://platform.openai.com/docs/models/ overview, accessed
          <source>on May 3</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bolzonello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Marchiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moschetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Trevisiol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zanini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <article-title>Seupd@clef: Team FADERIC on A query expansion and reranking approach for the longeval task</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2023</year>
          ), Thessaloniki, Greece,
          <source>September 18th to 21st</source>
          ,
          <year>2023</year>
          , volume
          <volume>3497</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>2252</fpage>
          -
          <lpage>2280</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3497</volume>
          / paper-188.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>J. Mach. Learn. Res</source>
          .
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <volume>140</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>140</lpage>
          :
          <fpage>67</fpage>
          . URL: http://jmlr.org/papers/v21/
          <fpage>20</fpage>
          -
          <lpage>074</lpage>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>