SEUPD@CLEF: Team DAM on Reranking Using Sentence
                         Embedders
                         Notebook for the LongEval Lab at CLEF 2024

                         Alberto Basaglia1 , Andrea Stocco1 , Milica Popović1 and Nicola Ferro1
                         1
                             University of Padua, Italy


                                         Abstract
                                         This report gives an overview of the system developed by Team DAM for Task 1 of the LongEval Lab at CLEF
                                         2024. The team members are students enrolled in the Computer Engineering master’s program at the University
                                         of Padua. The team developed an information retrieval system which is then used to perform queries on a corpus
                                         of documents in French language, or in their translated English version.

                                         Keywords
                                         CLEF, LongEval 2024, Information Retrieval, Search Engine, Documents Retrieval, Temporal Persistence, Rerank-
                                         ing, Word Embeddings


                         1. Introduction
                         Nowadays, online searching for all types of information has become part of people’s daily routines.
                         Billions of users worldwide expect to find needed information quickly and accurately. A search engine
                         (SE) is a software that helps people satisfy such a need using queries to express an information need.
                         Since the number of web pages has been rapidly increasing, there are considerable challenges that this
                         type of software faces. One of the main challenges is the variability of performance of the system over
                         time. That is why LongEval Lab, organized by the Conference and Labs of the Evaluation Forum (CLEF),
                         aims to solve this problem by encouraging participants to develop information retrieval (IR) systems
                         that can adapt to the evolution of corpus over time.
                            The paper is organized as follows: Section 3 describes our approach; Section 4 explains our exper-
                         imental setup; Section 5 discusses our main findings; finally, Section 6 draws some conclusions and
                         outlooks for future work.


                         2. Related Work
                         Sentence Embedders have been extensively used for the reranking phase of information retrieval systems
                         for many years [1] [2]. Recent research has continued to demonstrate the effectiveness of reranking
                         approaches. For instance, Bolzonello et al. (2023) utilized a reranking-based approach successfully,
                         further validating its efficacy in enhancing retrieval performance [3].


                         3. Methodology
                         Our approach is based on the usage of finely tuned off-the-shelf components provided by Apache
                         Lucene. In addition to those, a reranking phase based on Sentence Embedder has been implemented.
                         Our idea was to use a model fine-tuned mostly on online data (Reddit comments, citation pairs and
                         WikiAnswer, just to name a few) to try to encode the meaning of topics and documents in an effective

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 9–12, 2024, Grenoble, France
                          $ alberto.basaglia@studenti.unipd.it (A. Basaglia); andrea.stocco.8@studenti.unipd.it (A. Stocco);
                          milica.popovic@studenti.unipd.it (M. Popović); nicola.ferro@unipd.it (N. Ferro)
                           https://www.dei.unipd.it/~ferro/ (N. Ferro)
                           0000-0001-9219-6239 (N. Ferro)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: SE architecture


way. Furthermore, we tried to improve one of the works from the previous year which used reranking to
improve the IR system performance (Enrico Bolzonello et al. [3]). We used a similar approach based on
reranking a small chunk of documents, but using different sentence embedders and a different analyzer
pipeline.
   In this section we will cover the methodology that has been used to develop our IR system. In
order to better understand the complete IR system we developed, all the system’s components will be
explained using the diagram in Figure 1. The diagram shows the main components of a SE as well as
the differentiation between offline and online components.

3.1. Apache Lucene
To develop our IR system, we used Apache Lucene version 9.10.0 1 .
   The Apache Lucene project develops open-source search software. It is a high-performance, full-
featured SE library written entirely in Java. This library provides a robust and scalable set of tools to
developers building efficient IR systems [4]. Thanks to Apache Lucene, we’ve been able to handle vast
amounts of documents with ease, using powerful, accurate, and efficient search algorithms. Moreover, its
active community and frequent updates ensure that developers have access to the latest advancements
and optimizations in the field of IR.

3.2. Parsing
First of all, to create a fast and reliable IR system, it is necessary to parse the data we want to run our
queries on. For this task, LongEval releases the corpus of documents in two different formats, TREC
and JSON. We decided to use JSON files.
   The corpus came divided into more files, each of them containing a JSON array of documents. The
structure of a document is shown in Figure 2.

Figure 2: Structure of a document


              {
                   "docno": "...",
                   "text: "...."
              }


   It was thus necessary to write a parser able to read the documents efficiently from the disk.
In order to create an efficient parser for reading documents from the LongEval corpus in JSON format,
several key components were implemented:


1
    Downloaded from: https://lucene.apache.org/core/downloads.html
   • File Parser: a file parser is responsible for reading JSON files containing arrays of documents
     efficiently. This component iterates through each line of the file and extracts the JSON objects
     which represent a single document;
   • Document Model: a document model defines the structure of a document in the corpus. In this
     case, each document consists of two fields: docno (document number) and text (document text),
     as shown in Figure 2. This model is used to deserialize JSON objects into Java objects during
     parsing;
   • JSON Deserialization: the parser implements a JSON deserialization library, Jackson [5], to
     convert JSON objects into Java objects, this way it is easier to manipulate and access the document
     data;
   • Iterator Implementation: the parser also implements the Iterator interface to go through the
     documents in the corpus. This allows efficient and sequential processing of one document at a
     time, without loading the entire corpus into memory.

3.3. Analyzing
Once the parsing of the documents has been set up, the very next step is analyzing the documents, in
general, consisting of:

   • Tokenization: we split the documents into tokens, that are going to be our "unit" of computation
     in the system;
   • Stopword removal: a predefined list of words, considered to be useless in the context of search
     are removed. An example of these words is articles: they appear in every document so they do
     not help to discriminate documents;
   • Stemming: we reduce words to their root or base form to improve search results by capturing
     variations of the same word.

  These are the techniques that have been used in at least one of our experiments:

   • StandardTokenizer
   • StopFilter
   • ICUFoldingFilter
   • LengthFilter
   • SnowballFilter

   While the techniques mentioned above are valid for both English and French analyzing, some
techniques specific to a language have been implemented.

English
   • EnglishPossessiveFilter
   • KStemFilter

French
   • FrenchLightStemmer
   • ElisionFilter
3.4. Indexing
Once the documents analysis is completed, the next step is indexing. It is a crucial phase, since it
creates a searchable database, the index, that contains essential metadata about parsed documents. This
metadata includes details like the words and phrases within each document, their frequency, and their
location within the document. By structuring documents in this manner, we improve retrieval efficiency,
enabling users to search for documents based on keywords or phrases with ease. In order to do that, we
indexed the parsed documents through an inverted index, using two different fields:

    • DocNo
    • Content

So basically, every parsed document is saved in the indexer with these two fields, one used to identify
the document itself and the other representing its whole content.

3.5. Searching
The Searcher serves as the component responsible for interpreting input queries and searching through
indexed documents to identify those that fit best ("best match") with the query. Then it retrieves these
documents and presents them back to the user.

3.5.1. BM25
The next step is to fetch the pertinent documents based on the given queries: this involves identifying
the most similar documents to our queries using various scoring functions. So we assign scores to each
document in our collection, ranking them from highest to lowest. The highest-ranked document is
presumed to be the most relevant to the given query.
The BM25 ranking function, that belongs to the “BM family” of retrieval models (BM stands for Best
Match), in addition of being simple and effective, seems to be very competitive compared to more
modern techniques [6]. In Section 5 we used 𝑘1 = 1.2 and 𝑏 = 0.75 as parameters for BM25 (the default
ones from Apache Lucene).

3.5.2. Queries
Queries are the bridge between user information needs and the underlying document corpus. LongEval
provides query datasets in TSV (Tab-Separated Values) format, structured to include query identifiers
(num) and corresponding textual queries text. Each line in the TSV file represents a single query, with
the query identifier and text separated by a tab character. Follows an example from LongEval 2024 Test
Collection [7]:
                                    q062228 aeroport bordeaux.

   Once TSV queries are loaded and parsed they’re transformed into an Object, that stores num and text.
This is then submitted to the index searcher for retrieval of relevant documents, generating ranked lists
of documents based on their relevance to each query. Before submitting them to the actual searcher,
queries are parsed using Lucene Query Parser [8], since this package also provides many powerfull tools
to modify query terms and implement stategies like fuzzy search, proximity search and term boosting.

3.5.3. Proximity Search
In order to improve the performance of our IR system, one possibility can be using proximity search,
which allows us to search for a document based on how closely two or more search terms of the query
appear in the document. The distance between the two terms is given by a parameter k, which depends
on the context and the length of the documents. For example, the query "red brick house" could be used
to retrieve documents that contain phrases like "red house of brick" or "house made of red brick", while
in the meantime avoiding documents where the words are scattered or spread across. Later, we will
discuss the improvements given by this kind of search.

3.5.4. Synonyms
Another way to improve the performance of our IR system is using query expansion. Query expansion
is a technique that consists in reformulating the queries to better match relevant documents. There
are several ways to perform it; the one we tried was synonym query expansion. With this approach,
each query term is expanded with its own synonyms. To find all the synonyms of every english word
we used Wordnet, that is a large lexical database of English containing all the words (nouns, verbs,
adjectives, . . . ) grouped into sets of cognitive synonyms [10]. The same methodology was also applied
to all other Europian languages in EuroWordNet project. A list of words with associated synonyms,
retrieved from these databases, is available in our repository for both English and French. This approach
does not always lead to improvements, indeed there are cases that lead to worse system performance.
The results of our experiment will be shown in section 5.

3.5.5. Reranking
After the searching process has retrieved the highest ranked documents with respect to the employed
criteria, we can apply a second phase of ranking. This phase is not going to look through all the
documents again, but instead it is going to work with the documents the first phase retrieved. In our
case, we will take only the first 𝑘 documents for efficiency reasons. The value of 𝑘 will be discussed
later. The following diagram shows the process flow happening during the reranking phase.
Figure 3: Reranking process flow


   The reranking approach we use is based on machine learning, specifically on sentence embedding
models.
   Essentially, we employ a pre-trained model that maps text to a vector in a multidimensional space.
For each one of the topics we are processing, we compute its vector. Then we take the first 𝑘 documents
retrieved by Lucene’s searcher and compute their vector. We then compute a score for each match
using the dot product. The resulting value is used to rerank the documents retrieved by the search by
increasing their score accordingly.
   For each one of the 𝑘 documents, its updated score will be computed as follows:

                            𝑠𝑐𝑜𝑟𝑒𝑖 ← 𝑠𝑐𝑜𝑟𝑒𝑖 + 𝑅 · sim (emb(𝑞), emb(𝑑𝑖 )) .
Where 𝑅 is a coefficient that is used to decide how much to value the output of the sentence embedder;
sim is the function that computes the similarity between the two vectors. In our system we use one
provided by the sentence_transformers python package [11]; emb computes the embedding of a
piece of text. In our example, it is used to compute the vector for the query and for the i-th document.
This process will be applied to all the queries.
   The model we used is all-mpnet-base-v2 and it is freely available from HuggingFace [12]. It is
trained mostly on English data, from A Repository of Conversational Datasets by Henderson et al. [13],
so we expect a greater increase in performance when using it for the translated documents. As we will
see, this is the case.
   The architecture of the model is based on Microsoft’s MPNet, which stands for Masked and Permuted
Language Modeling [14]. MPNet combines the best parts of BERT and XLNet. It uses a special training
[11] that mixes up the order of words to learn their connections while still understanding the context from
both directions, like BERT does. This hybrid approach allows MPNet to learn contextual representations
more effectively, with improved performance on various natural language understanding tasks as the
result. The model is formed by 12 transformer layers, with each one of them having 768 hidden units and
12 attention heads. Given its training on extensive English datasets from Henderson et al.’s repository,
all-mpnet-base-v2 excels in tasks involving English text, and as demonstrated, it shows significant
improvements when applied to translated documents. As we will se in Section 5, it will allow us to
improve our information retrieval system on the original French documents as well.
   As a similarity function we used the dot_score function from the sentence_transformers
package. This function computes the dot product between the two vector associated with the documents
[11].
   Figure 4 illustrates the reranking algorithm that has been implemented, detailing the number of
documents after each phase. To clarify, N represents the number of top-ranked documents retrieved in
the initial search phase, while k has already been mentioned. As it can be seen from the diagram, only
scores for the first k documents are updated, while the remaining N-k documents aren’t reranked.
Figure 4: Reranking procedure with the number of documents after each phase


   As we previously said, to interact with the model we used the sentence-transformers library in
python, and in order to interact with python from the java code, we use a flask HTTP server [15]. We
will then use HTTP requests from the Java code to get the score of a document with respect to a query.
   Althought this adds a little overhead, it was much simpler than interacting with the sentence embedder
directly from Java.
   All the experiments concerning the reranker are available in Section 5.4.

3.5.6. N-gram word model
This section briefly explains another technique for improving SE results, called the N-gram word model.
The idea behind it is that phrases are particularly significant in the IR field. Specifically, statistics show
that most two or three-word queries are phrases. Because of this, it is important to consider multiple
words as phrases rather than independent words. However, the impact of using phrases can be complex,
so it is essential to be careful when using them. Let’s consider the following definition of the word
phrase: Phrase is any sequence of n words, also known as an n-gram. Sequences of two words are called
bigrams, while sequences of three words are called trigrams. The greater the frequency of occurrence of
a word n-gram, the higher the probability that it corresponds to a meaningful phrase in the language.
In the context of IR, n-grams are used to index and retrieve documents based on user queries [16].


4. Experimental Setup
The setup used to run the experiments is the following:
    • The system has been run on the collections available from clef-longeval.github.io/data/. All the
      evaluation has been performed on the 2024 Training Set;
    • To evaluate the performance of the SE we used trec_eval 9.0.7, available at trec.nist.gov/trec_eval/;
    • The repository with the code can be found at bitbucket.org/upd-dei-stud-prj/seupd2324-dam/.

  We ran most of the experiments on the following hardware:
    • Intel i7 6700k
    • Msi GTX 970
    • 16GB DDR4 memory

   For reproducibility reasons, the runs were also submitted to TIRA [17]. Since some of the experiments
utilize the reranker, the container is given access to a GTX 1080 video card.

4.1. TIRA
TIRA is a key platform for research in information retrieval, designed to facilitate blinded and repro-
ducible experiments. Generally, studies in this field suffer from a lack of reproducibility, since typically
only test collections and research papers are shared, requiring third parties to rebuild software to test
new datasets. To address this, TIRA has been upgraded to ease task setup and software submission,
scaling efficiently from local setups to cloud-based systems using parallel CPU and GPU processing.
Overall, TIRA enhances the conduct of AI experiments, ensuring both secrecy and repeatability, and
improving the reliability and progression of research in information retrieval.
In particular for Information Retrieval TIREx (Information Retrieval Experiment Platform) [18] has
been developed. It integrates ir_datasets, ir_measures, and PyTerrier [19] with TIRA to promote more
standardized, reproducible, scalable, and even blinded retrieval experiments.


5. Results and Discussion
Before exploring the results our system is able to obtain, it is useful to list all the possible components
we can make use of and assign them a keyword. Such keyword will be used in the name of the run to
identify it.

    • FR: if the indexing and searching process was done on the French version of the documents.
    • EN: if the indexing and searching process was done on the English version of the documents.
    • Snowball: if the Snowball stemmer was used.
    • Krovetz: if the Krovetz stemmer was used.
    • FrenchLight: if the FrenchLight stemmer was used.
    • Poss: if the English possessive filter was used.
    • Elision: if the ElisionFilter was used.
    • Stop: if the list of stopwords was used. For the English documents we used a default one provided
      by Lucene. For the French documents we used a custom list that is available in our repository. It
      is important to note that for french documents we used both lists. This was done because the
      documents contain some paragraphs in English language.
    • ICU: if ICU folding was used.
    • Prox: if proximity search was used. If this keyword is present, we will write also the distance
      parameters, as explained in Section 3.5.3. For example, if proximity search at a distance of 50
      characters is employed, we will write Prox(50).
    • Reranking: if reranking was used. This will come with a parameter too: the 𝑘 value we discussed
      in Section 3.5.5. The 𝑅 parameter instead will take a fixed value of 5. As an example, if the system
      reranks the first 50 documents, we will write Reranking(50).
    • Syns: if the query expansion technique using synonyms was used.
    • Shingles: if word N-grams are being used. In our case we will generate ngrams of length 2 and 3.

  All runs also employ a filter that discards all tokens shorter than 2 characters and longer than 20.
Other than that, by default, all the tokens are transformed into lowercase. All the experiments have
been run using BM25 with default parameters as the ranking function.
  The two main metrics we are going to use to compare runs are the Normalized Discounted Cumulative
Gain (nDCG) and the Mean Average Precision (MAP). The reason why we chose these metrics is that
they are widely used in IR tasks, since they offer a comprehensive understanding of the effectiveness
of retrieval systems. Additionally, they are simple and easy to understand and interpret. nDCG helps
measure how close the system’s output is when compared to an ideal run, where all the items are
sorted in decreasing order of relevance. Moreover, since nDCG is a normalized metric, it enables fair
comparisons between different lists of varying lengths and relevance distributions [20]. MAP, on the
other hand, calculates the average precision (AP) across all relevant documents, giving insight into the
system’s overall precision and recall. MAP decreases more rapidly if there are non-relevant items at
the top [21]. By deciding to calculate both previously mentioned metrics, we can assess the system’s
performance from different perspectives.

5.1. Baseline Lucene performance
To set a baseline for our next runs, we ran a baseline Lucene configuration on both the French and
the English documents. This baseline configuration will just use the standard tokenizer, the Snowball
stemmer, the lower case filter and the BM25 as retrieval model.
   The results we obtained are shown in Table 1.

Table 1
Baseline performance
                                           Language   nDCG      MAP
                                           French     0.4563    0.2553
                                           English    0.3681    0.1842


   As we can see, the performance on the “original” French dataset is better than the on on the translated
English dataset. nDCG and MAP are 23.96% and 38.59% better respectively.
   This could be explained by the fact that the automated translation is not very accurate and some
information is lost in the process.

5.2. Choice of stemmer
As described in Section 3, there are several stemmers that can be used to derive word roots. We
made some runs to decide the best one for English and French respectively. For both collections, the
corresponding stoplists were used to conduct the experiments.

Table 2
Effect of stemmers on French collection
                                           Stemmer     nDCG      MAP
                                        SnowBall       0.4566    0.2573
                                       FrenchLight     0.4633    0.2626

  FrenchLight stemmer, compared with SnowBall, improves the performance of our system, increment-
ing both nDCG and MAP of 1.47% and 2.06%, respectively.

Table 3
Effect of stemmers on English collection
                                           Stemmer    nDCG      MAP
                                           SnowBall   0.3729    0.1881
                                            Krovetz   0.3661    0.1823

  In this case, SnowBall stemmer improves the performance of our system compared to Krovetz. Both
nDCG and MAP increase of 1.86% and 3.18%, respectively.
5.3. Synonyms
In this section we are going to talk about synonym query expansion. We tried this approach both for
English and French. The experiment setup was:
    • english stoplist, possessive filter and SnowBall stemmer for English collection.
    • french stoplist, elision filter, ICU filter and FrenchLight stemmer for French collection.
The synonyms included in the queries are analyzed with the same analyzer used for indexing the
documents and, furthermore, are assigned a weight of 0.7 to reduce their significance compared to the
words originally present in the queries. The results are reported in Tables 4 and 5.

Table 4
Effect of synonym query expansion on English collection
                                        Synonyms     nDCG       MAP
                                            Yes      0.3605    0.1798
                                            No       0.3728    0.1880


Table 5
Effect of synonym query expansion on French collection
                                        Synonyms     nDCG       MAP
                                            Yes      0.4648    0.2581
                                            No       0.4776    0.2733

   As we can see, in both cases the synonym query expansion worsened the performance of our
information retrieval system. One of the reasons why this is happening could be that synonyms that are
not related to the queries are still included in them because, maybe, they have the same root. So, in the
following experiments we completely discard synonyms to focus more on other methods to improve
the effectiveness of our system.

5.4. Choice of the reranker parameter
Since, as it was covered in Section 3.5.5, we need to decide a threshold for the number of documents to
undergo reranking, we made some runs to decide the best value.
   We will use the reranker for both experiments with the English documents and French documents,
so we tried it on both the collections. In addition to the techniques used in the previous experiment, we
also added a stopword list.
   For the French collection the stopword list, ICU folding and the elision filter have been used as well.
The runs on the English collection made use of the English possessive filter.

Table 6
Effect of the reranker on the French collection
                                             k    nDCG        MAP
                                             0    0.4776   0.2733
                                             25   0.4832   0.2883
                                             50   0.4877   0.2942
                                            100   0.4904   0.2976
                                            150   0.4917   0.2986
                                            200   0.4918   0.2985

  As we can see, the reranker allow us to improve the performance of the retrieval system. Between the
run with no reranking (𝑘 = 0) and the one with 𝑘 = 200, we notice an increase in the nDCG of 2.97%.
  The best value of MAP occurs instead for 𝑘 = 150, with an increase of 9.25%.

Table 7
Effect of the reranker on the English collection
                                              k      nDCG    MAP
                                             0      0.3728   0.1880
                                             25     0.3898   0.2094
                                             50     0.3947   0.2167
                                            100     0.3970   0.2195
                                            150     0.3982   0.2206
                                            200     0.3990   0.2212


   For what concerns the English collection, the best result is obtained with 𝑘 = 200, with an increase
of nDCG and MAP of, respectively, the 7.03% and 17.66%.

5.5. Training results
In this section we are going to compare the best configurations for various parameters in both French
and English languages. These configurations represent the runs we have chosen to submit to the
LongEval Conference:

    • The best system that doesn’t use reranking for English language;
    • The best system for English language, using reranking and SnowBall stemmer;
    • The best system that doesn’t use reranking;
    • The best overall system. This is the system that achieved the best results using reranking and
      other parameters discussed in the previous sections;
    • A system using word N-grams.

  The results of these systems are displayed in the following Table 8.

Table 8
Performance of the submitted configurations on the training dataset
       Label                                       System                            nDCG      MAP
     System 1                    EN-Stop-SnowBall-Poss-Prox(50)                      0.3820   0.1969
     System 2             EN-Stop-SnowBall-Poss-Prox(50)-Reranking(200)              0.3963   0.2147
     System 3                FR-Stop-FrenchLight-Elision-ICU-Prox(50)                0.4914   0.2925
     System 4         FR-Stop-FrenchLight-Elision-ICU-Prox(50)-Reranking(150)        0.5024   0.3079
     System 5     FR-Stop-FrenchLight-Elision-ICU-Shingles-Prox(50)-Reranking(150)   0.4774   0.2758

   One first notable observation is that our search engine generally performs better on French documents.
As we previously noted, this trend may be attributed to the fact that the original corpus was written in
French and later translated into English.
A further analysis of the results reveals that re-ranking leads to the best performance for both languages.
However, the introduction of N-grams (Shingles) technique significantly decreases the search engine’s
performance, nullifying the benefits of re-ranking and resulting in an even worse outcome compared to
the best-performing configuration without this approach.
   Figure 5 shows the interpolated precision-recall curves of the systems described in Table 8. This
type of plot is useful to represent the trend of two different measures like precision and recall. In this
way we can compare different systems to figure out which one is better than the others. System 5
intersects System 3 curve at recall 0.2. Hence, System 5 performs better at the high ranks and worse
at the low ranks. Regarding other systems, the curves never intersect, so their performance are more
clearly separated. Accordingly to the results reported in Table 8, System 4 is the best.
Figure 5: Interpolated Precision Recall curves on the training dataset


5.6. Test results
To conclude how well the submitted runs perform, we analyzed nDCG and MAP measures, and also
performed Statistical Hypothesis Testing (SHT). The analysis was done for each of the five submitted
runs, for both Short-term and Long-term collections. Short-term represents the collection from June
2023, while Long-term represents the collection from August 2023. This kind of analysis is essential as
it shows how well each IR system performs. Moreover, it is particularly significant to compare these
results since the purpose of the LongEval task is to find an IR system that can handle changes over time,
meaning that we are interested in systems that have low measure drops over time. Table 9 shows the
performance drop of the systems between June and August.

Table 9
Comparison between June and August
                                                                                        nDCG                 MAP
 Label          System
                                                                                    June   August     June     August
   System 1     EN-Stop-SnowBall-Poss-Prox(50)                                     0.2938   0.2205   0.1564    0.1116
   System 2     EN-Stop-SnowBall-Poss-Prox(50)-Reranking(200)                      0.3039   0.2308   0.1686    0.1214
   System 3     FR-Stop-FrenchLight-Elision-ICU-Prox(50)                           0.3849   0.2855   0.2351    0.1615
   System 4     FR-Stop-FrenchLight-Elision-ICU-Prox(50)-Reranking(150)            0.3964   0.2942   0.2489    0.1709
   System 5     FR-Stop-FrenchLight-Elision-ICU-Shingles-Prox(50)-Reranking(150)   0.3701   0.2794   0.2204    0.1564


   Additionally, SHT was performed to determine if the systems performance is statistically different
or not. This is important to understand whether some configurations improve the system or not, for
example, if the usage of reranking has a real effect on the performance or the mean increase is due only
to the variance of the test.
   In this section, we will further explain SHT, present the results for the previously mentioned metrics
as well as for SHT, and provide our conclusions about them.

5.6.1. Statistical Hypothesis Testing
SHT is a type of statistical analysis used to estimate the relationship between statistical variables. Later
in this section, the results produced by performing SHT on both Short-term and Long-term datasets
will be shown and explained. SHT is important for determining in a scientifically valid way whether
our systems are performing similarly or differently. In other words, we are interested in knowing if
there is a statistical significant difference between them.
   In order to perform SHT, the two mutually exclusive hypotheses H0 and H1 must be defined. H0 is
called the null hypothesis while H1 is the alternative hypothesis. Beside these hypotheses a threshold 𝛼
must be defined, representing a significance level. For example, 𝛼 = 0.05 means there is a 5% probability
of wrongly declaring that the systems are different. SHT uses sample data to determine if H0 can be
rejected. If that is the case, it means that the alternative hypothesis H1 is true [22].
   In particular, we will use Two-Way Analysis of Variance (ANOVA2) as a statistical test. It examines
the influence of two different variables which are, in our case, the systems and the topics. ANOVA2 is
used to evaluate the difference between the means of more than two groups, which is useful in our case
since we have 5 different IR systems. In ANOVA2 the hypotheses are as follows:

    • H0 - the means of all groups are equal,
    • H1 - at least 2 groups have different means [23].

5.6.2. Short-term
In this section, results for nDCG and AP measures, as well as for SHT, are presented for Short-term
collection. Additionally, conclusions for the given results are provided.
   Table 9 displays nDCG and MAP measures of the five submitted systems for the Short-term collection.
It is evident that the best-performing system for the training data, based on both measures, remains
consistent: FR-Stop-FrenchLight-Elision-ICU-Prox(50)-Reranking(150). However, there is a notable
performance drop of 21.12% for nDCG and 19.16% for MAP. This drop was expected, as user queries
and preferences have evolved over time. Notably, the systems with English configurations experience
the largest drops. This phenomenon could be related to the automated translation process from French
to English.
   Based on the box plots in Figures 6 and 7, it is notable that the FR systems tend to have higher
median values for both measures compared to the EN systems, indicating generally better performance.
Additionally, the FR systems do not have as many outliers as the EN systems, which shows a more
consistent performance.


                     (a) nDCG                                                 (b) AP
Figure 6: Box plots on Short-term collection for EN systems, the dashed lines represent the mean values and
the solid ones the medians

  Table 10 and Table 11 show the results of the ANOVA2 test.
The SS column shows the total variability for each source. Higher values indicate greater variability. df
indicates how many degrees of freedom a source of variation has. For example, since the matrix of the
scores contains a system for each column, and 5 systems are being compared, the columns will have 4
degrees of freedom.
  The MS column is the average of the sum of squares (SS) for each source, calculated by dividing SS
by its corresponding degrees of freedom (df). It represents the variance for each source.
                     (a) nDCG                                                    (b) AP
Figure 7: Box plots on Short-term collection for FR systems, the dashed lines represent the mean values and the
solid ones the medians


   F shows the F-statistic, calculated as the ratio of the mean squares of the source to the mean squares
of the error. It is used to determine if the observed variance between groups is significantly greater
than the variance within groups.
   Finally, the column we are most interested in, Prob>F, indicates the p-value associated with the
F-statistic. A lower p-value (in our case less than 0.05) suggests that the differences between group
means are statistically significant.
   In this case, both Table 10 and Table 11, allow us to state that the 5 systems have different performance
with a probability of being wrong very close to 0.

Table 10
Anova2 AP
                        Source        SS          df      MS           F       Prob>F
                        Systems     2.75807       4     0.68951    95.04877          0
                         Topics    63.14746     403     0.15669    21.59988          0
                          Error    11.69404    1612     0.00725
                          Total    77.59958    2019


Table 11
Anova2 nDCG
                        Source        SS          df      MS           F       Prob>F
                        Systems     3.71902       4     0.92975 113.77239            0
                         Topics    70.53095     403     0.17501 21.41620             0
                          Error    13.17338    1612     0.00817
                          Total    87.42336    2019

   After discovering that the 5 systems are different, it is useful to compare systems pairwise in order
to understand where the difference comes from. For this purpose, we will employ Tukey’s Honest
Significant Difference (HSD) test.
                                                   (a) nDCG


                                                    (b) AP
Figure 8: Plot of the differences in means across multiple groups, illustrating the variation within and between
groups


   Figure 8 shows a comparison between the mean of the different groups. It is useful to visualize this
to understand if there is a real difference in the performance of the systems. For example, in both plots,
System 1 was selected. This shows that there is a statistical difference between it and Systems 3, 4 and
5. For what concerns System 2, we can’t say that it is different from System 1 with 𝛼 = 0.05.
   The output of the test is reported in Table 12 and Table 13 for AP and nDCG, respectively. The
p-values lower than 𝛼 = 0.05 are shown in bold, meaning that we refuse to accept the null hypothesis.
Table 12
Systems comparison AP
                                    System A    System B   P-value
                                    System 1    System 2   0.24863
                                    System 1    System 3   0
                                    System 1    System 4   0
                                    System 1    System 5   0
                                    System 2    System 3   0
                                    System 2    System 4   0
                                    System 2    System 5   0
                                    System 3    System 4   0.14086
                                    System 3    System 5   0.10402
                                    System 4    System 5   0.00001


Table 13
Systems comparison nDCG
                                    System A    System B   P-value
                                    System 1    System 2   0.50142
                                    System 1    System 3   0
                                    System 1    System 4   0
                                    System 1    System 5   0
                                    System 2    System 3   0
                                    System 2    System 4   0
                                    System 2    System 5   0
                                    System 3    System 4   0.36529
                                    System 3    System 5   0.13639
                                    System 4    System 5   0.00033


5.6.3. Long-term
In this section, results for nDCG and AP measures, as well as for SHT, are presented for Long term
collection. Additionally, conclusions for given results are provided.
   Table 9 shows the nDCG and AP scores of the five systems submitted for the Long-term collection.
It’s clear that the best-performing system, based on both measures, remains consistent: FR-Stop-
FrenchLight-Elision-ICU-Prox(50)-Reranking(150). However, there’s a significant performance decline
of 41.45% for nDCG and 44.47% for MAP between training and Long-term data. This decline was
expected, given the evolution of user queries and preferences over time.
   Based on the box plots in Figures 9 and 10, we can make similar conclusions we made when comparing
box plots for the Short-term collection - the FR systems showed better performance for both measures,
compared to the EN systems. Moreover, the FR systems have fewer outliers than the EN systems.
When comparing the box plots 9 and 10 for the Long-term collection with the box plots 6 and 7 for the
Short-term collection, we can conclude the following: all five IR systems show a slight decrease in the
median nDCG and AP values over time, which was expected. Another notable observation is that the
box plots for the Long-term collection have more outliers than those for the Short-term collection. This
behavior could be explained with various factors, such as the impact of different queries and documents
and changes in user behavior over time.
   In the same fashion as what was done for the Short-term dataset in Section 5.6.2, we run the ANOVA2
test and the pairwise comparison using the HSD test. Table 14 and Table 15 show the results of the
ANOVA2 test. The output of the systems comparison is reported in Table 16 and Table 17 for AP and
nDCG, respectively. The p-values lower than 𝛼 = 0.05 are shown in bold, meaning that we refuse
to accept the null hypothesis. One notable thing is that, on these datasets, all the systems, except for
System 3 and System 5 are statistically different.
                     (a) nDCG                                                   (b) AP
Figure 9: Box plots on Long-term collection for EN systems, the dashed lines represent the mean values and the
solid ones the medians


                     (a) nDCG                                                   (b) AP
Figure 10: Box plots on Long-term collection for FR systems, the dashed lines represent the mean values and
the solid ones the medians


Table 14
Anova2 AP
                        Source        SS         df      MS           F        Prob>F
                       Systems   4.17251          4     1.04312 212.17106            0
                        Topics 144.32133       1515     0.09526 19.37610             0
                         Error  29.79367       6060     0.00491
                         Total 178.28752       7579


Table 15
Anova2 nDCG
                        Source        SS         df      MS           F        Prob>F
                       Systems   6.96973          4     1.74243 269.51048            0
                        Topics 209.55709       1515     0.13832 21.39485             0
                         Error  39.17897       6060     0.00646
                         Total 255.70580       7579
                                                   (a) nDCG


                                                     (b) AP
Figure 11: Plot of the differences in means across multiple groups, illustrating the variation within and between
groups
Table 16
Systems comparison AP
                          System A   System B   P-value
                          System 1   System 2   0.00115
                          System 1   System 3   0
                          System 1   System 4   0
                          System 1   System 5   0
                          System 2   System 3   0
                          System 2   System 4   0
                          System 2   System 5   0
                          System 3   System 4   0.00221
                          System 3   System 5   0.25629
                          System 4   System 5   0


Table 17
Systems comparison nDCG
                          System A   System B   P-value
                          System 1   System 2   0.00419
                          System 1   System 3   0
                          System 1   System 4   0
                          System 1   System 5   0
                          System 2   System 3   0
                          System 2   System 4   0
                          System 2   System 5   0
                          System 3   System 4   0.02270
                          System 3   System 5   0.22570
                          System 4   System 5   0
6. Conclusions and Future Work
In this section we summarize the main achievements and the conclusions we reached during the
development of the SE. Firstly, using a basic configuration (with the standard tokenizer, the Snowball
stemmer, and the lowercase filter), we observed better performance on the French dataset compared
to the English dataset. We could explain this behavior with the fact that the translation to English is
automated.
   Secondly, the default performance shown in Section 5 is quite good, but after experimenting with
different configurations, we managed to find better parameters. It’s important to mention that we
always used the appropriate stopword list, as we noticed an improvement in performance when utilizing
them. Moreover, the SE performed better when using an appropriate stemmer, while the usage of
synonyms had decremental effects on it. Another technique that brought significant progress was the
usage of a reranker for both languages.
   While analyzing performance drops and SHT results for both Short-term and Long-term collections,
we arrived at the following conclusions. There is a significant difference between the systems with
French and English configurations in terms of performance. The FR systems consistently outperform
the EN systems in both datasets. As already mentioned, this is related to the automated translation from
French to English and the fact that the language evolves over time [24]. This demonstrates that language-
specific optimizations play a vital role in the effectiveness of retrieval systems. Moreover, systems with
reranking generally perform better than non-reranking systems. Furthermore, the performance drop
over time is evident, and it highlights the need of continuous updates to maintain performance over
time.
   Regarding future work, we could try to improve query expansion with the use of Large Language
Models (LLMs), which have shown remarkable capabilities in the IR field, particularly in text under-
standing [25]. Indeed, in order to better perceive the user’s intent and create a more efficient query,
we could reformulate it. For this purpose, we could use a language model to rephrase the query and
hopefully increase the performance of the SE. These models can grasp the context and meaning of text,
enabling more accurate retrieval of relevant documents [26]. Another idea that could be beneficial to
the performance of our system would be to use a sentence embedder model trained specifically on
French data. This hypothetically could increase the boost in performance obtained with the use of the
reranker even more.
References
 [1] Sushilkumar Chavhan, M. Raghuwanshi, R. Dharmik, Information Retrieval using Machine learning
     for Ranking: A Review , https://iopscience.iop.org/article/10.1088/1742-6596/1913/1/012150/meta,
     2021. [Online; accessed: 2024-05-22].
 [2] R. Nogueira, W. Yang, K. Cho, J. Lin, Multi-stage document ranking with bert, 2019. URL: https:
     //arxiv.org/abs/1910.14424. arXiv:1910.14424.
 [3] E. Bolzonello, C. Marchiori, D. Moschetta, R. Trevisiol, F. Zanini, N. Ferro, et al., Seupd@ clef:
     Team faderic on a query expansion and reranking approach for the longeval task, in: CEUR
     WORKSHOP PROCEEDINGS, volume 3497, CEUR-WS, 2023, pp. 2252–2280.
 [4] Apache Lucene Website, https://lucene.apache.org/, 2024. [Online; accessed: 2024-05-30].
 [5] Jackson Github Page, https://github.com/FasterXML/jackson, 2024. [Online; accessed: 2024-06-01].
 [6] S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: Bm25 and beyond,
     Foundations and Trends® in Information Retrieval 3 (2009) 333–389.
 [7] LongEval 2024 Test Collection, https://doi.org/10.48436/xr350-79683, 2024. [Online; accessed:
     2024-06-01].
 [8] Lucene Query Parser Documentation, https://lucene.apache.org/core/9_10_0/queryparser/org/
     apache/lucene/queryparser/classic/package-summary.html, 2024. [Online; accessed: 2024-06-02].
 [9] C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, MIT Press, Cambridge, MA, 1998.
[10] G. A. Miller, Wordnet: A lexical database for english, Communications of the ACM 38 (1995)
     39–41.
[11] Nils Reimers, Iryna Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-
     Networks, https://arxiv.org/abs/1908.10084, 2019. [Online; accessed: 2024-05-05].
[12] all-mpnet-base-v2          Documentation,           https://huggingface.co/sentence-transformers/
     all-mpnet-base-v2, 2024. [Online; accessed: 2024-06-05].
[13] M. Henderson, P. Budzianowski, I. Casanueva, S. Coope, D. Gerz, G. Kumar, N. Mrkšić, G. Sp-
     ithourakis, P.-H. Su, I. Vulic, T.-H. Wen, A repository of conversational datasets, in: Proceedings
     of the Workshop on NLP for Conversational AI, 2019. URL: https://arxiv.org/abs/1904.06472, data
     available at github.com/PolyAI-LDN/conversational-datasets.
[14] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, MPNet: Masked and Permuted Pre-
     training for Language Understanding , https://arxiv.org/abs/2004.09297, 2020. [Online; accessed:
     2024-05-22].
[15] Flask Documentation, https://flask.palletsprojects.com/en/3.0.x/, 2024. [Online; accessed: 2024-06-
     01].
[16] W. Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines - Information Retrieval in
     Practice, https://ciir.cs.umass.edu/irbook/, 2015. [Online; accessed: 2024-05-01].
[17] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
     Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot,
     F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances
     in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes
     in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/
     978-3-031-28241-6_20.
[18] M. Fröbe, J. Reimer, S. MacAvaney, N. Deckers, S. Reich, J. Bevendorff, B. Stein, M. Hagen,
     M. Potthast, The Information Retrieval Experiment Platform, in: H. Chen, W. E. Duh,
     H. Huang, M. P. Kato, J. Mothe, B. Poblete (Eds.), 46th International ACM SIGIR Conference
     on Research and Development in Information Retrieval (SIGIR 2023), ACM, 2023, pp. 2826–2836.
     doi:10.1145/3539618.3591888.
[19] C. Macdonald, N. Tonellotto, S. MacAvaney, I. Ounis, Pyterrier: Declarative experimentation in
     python from bm25 to dense retrieval, in: Proceedings of the 30th ACM International Conference
     on Information and Knowledge Management (CIKM 2021), 2021, pp. 4526–4533.
[20] Aparna        Dhinakaran,          Demystifying       NDCG,       https://towardsdatascience.com/
     demystifying-ndcg-bee3be58cfe0, 2023. [Online; accessed: 2024-05-02].
[21] Ren Jie Tan, Breaking Down Mean Average Precision (mAP), https://towardsdatascience.com/
     breaking-down-mean-average-precision-map-ae462f623a52, 2019. [Online; accessed: 2024-05-02].
[22] Christina Majaski, Hypothesis Testing: 4 Steps and Example, https://www.investopedia.com/
     terms/h/hypothesistesting.asp, 2024. [Online; accessed: 2024-05-15].
[23] Will Kenton, What Is Analysis of Variance (ANOVA)?, https://www.investopedia.com/terms/a/
     anova.asp, 2024. [Online; accessed: 2024-05-15].
[24] Rabab Alkhalifa, Elena Kochkina, Arkaitz Zubiaga, Building for tomorrow: Assessing the temporal
     persistence of text classifiers, https://arxiv.org/abs/2205.05435, 2022. [Online; accessed: 2024-05-
     22].
[25] Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan
     Chen, Zhicheng Dou, Ji-Rong Wen , Large Language Models for Information Retrieval: A Survey,
     https://arxiv.org/pdf/2308.07107, 2024. [Online; accessed: 2024-05-04].
[26] Vishal Gupta, Ashutosh Dixit, Shilpa Sethi, An Improved Sentence Embeddings based Information
     Retrieval Technique using Query Reformulation, https://ieeexplore.ieee.org/document/10141788,
     2023. [Online; accessed: 2024-05-04].