<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SEUPD@CLEF: Team DAM on Reranking Using Sentence Embedders</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alberto Basaglia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Stocco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Milica Popović</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This report gives an overview of the system developed by Team DAM for Task 1 of the LongEval Lab at CLEF 2024. The team members are students enrolled in the Computer Engineering master's program at the University of Padua. The team developed an information retrieval system which is then used to perform queries on a corpus of documents in French language, or in their translated English version. Nowadays, online searching for all types of information has become part of people's daily routines. Billions of users worldwide expect to find needed information quickly and accurately. A search engine (SE) is a software that helps people satisfy such a need using queries to express an information need. Since the number of web pages has been rapidly increasing, there are considerable challenges that this type of software faces. One of the main challenges is the variability of performance of the system over time. That is why LongEval Lab, organized by the Conference and Labs of the Evaluation Forum (CLEF), aims to solve this problem by encouraging participants to develop information retrieval (IR) systems that can adapt to the evolution of corpus over time. The paper is organized as follows: Section 3 describes our approach; Section 4 explains our experimental setup; Section 5 discusses our main findings; finally, Section 6 draws some conclusions and outlooks for future work.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;CLEF</kwd>
        <kwd>LongEval 2024</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Search Engine</kwd>
        <kwd>Documents Retrieval</kwd>
        <kwd>Temporal Persistence</kwd>
        <kwd>Reranking</kwd>
        <kwd>Word Embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        Sentence Embedders have been extensively used for the reranking phase of information retrieval systems
for many years [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Recent research has continued to demonstrate the efectiveness of reranking
approaches. For instance, Bolzonello et al. (2023) utilized a reranking-based approach successfully,
further validating its eficacy in enhancing retrieval performance [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Our approach is based on the usage of finely tuned of-the-shelf components provided by Apache
Lucene. In addition to those, a reranking phase based on Sentence Embedder has been implemented.
Our idea was to use a model fine-tuned mostly on online data (Reddit comments, citation pairs and
WikiAnswer, just to name a few) to try to encode the meaning of topics and documents in an efective
way. Furthermore, we tried to improve one of the works from the previous year which used reranking to
improve the IR system performance (Enrico Bolzonello et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). We used a similar approach based on
reranking a small chunk of documents, but using diferent sentence embedders and a diferent analyzer
pipeline.
      </p>
      <p>In this section we will cover the methodology that has been used to develop our IR system. In
order to better understand the complete IR system we developed, all the system’s components will be
explained using the diagram in Figure 1. The diagram shows the main components of a SE as well as
the diferentiation between ofline and online components.</p>
      <sec id="sec-3-1">
        <title>3.1. Apache Lucene</title>
        <p>To develop our IR system, we used Apache Lucene version 9.10.0 1.</p>
        <p>
          The Apache Lucene project develops open-source search software. It is a high-performance,
fullfeatured SE library written entirely in Java. This library provides a robust and scalable set of tools to
developers building eficient IR systems [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Thanks to Apache Lucene, we’ve been able to handle vast
amounts of documents with ease, using powerful, accurate, and eficient search algorithms. Moreover, its
active community and frequent updates ensure that developers have access to the latest advancements
and optimizations in the field of IR.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Parsing</title>
        <p>First of all, to create a fast and reliable IR system, it is necessary to parse the data we want to run our
queries on. For this task, LongEval releases the corpus of documents in two diferent formats, TREC
and JSON. We decided to use JSON files.</p>
        <p>The corpus came divided into more files, each of them containing a JSON array of documents. The
structure of a document is shown in Figure 2.</p>
        <p>
          It was thus necessary to write a parser able to read the documents eficiently from the disk.
In order to create an eficient parser for reading documents from the LongEval corpus in JSON format,
several key components were implemented:
1Downloaded from: https://lucene.apache.org/core/downloads.html
• File Parser: a file parser is responsible for reading JSON files containing arrays of documents
eficiently. This component iterates through each line of the file and extracts the JSON objects
which represent a single document;
• Document Model: a document model defines the structure of a document in the corpus. In this
case, each document consists of two fields: docno (document number) and text (document text),
as shown in Figure 2. This model is used to deserialize JSON objects into Java objects during
parsing;
• JSON Deserialization: the parser implements a JSON deserialization library, Jackson [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], to
convert JSON objects into Java objects, this way it is easier to manipulate and access the document
data;
• Iterator Implementation: the parser also implements the Iterator interface to go through the
documents in the corpus. This allows eficient and sequential processing of one document at a
time, without loading the entire corpus into memory.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Analyzing</title>
        <p>Once the parsing of the documents has been set up, the very next step is analyzing the documents, in
general, consisting of:
• Tokenization: we split the documents into tokens, that are going to be our "unit" of computation
in the system;
• Stopword removal: a predefined list of words, considered to be useless in the context of search
are removed. An example of these words is articles: they appear in every document so they do
not help to discriminate documents;
• Stemming: we reduce words to their root or base form to improve search results by capturing
variations of the same word.</p>
        <p>These are the techniques that have been used in at least one of our experiments:
• StandardTokenizer
• StopFilter
• ICUFoldingFilter
• LengthFilter
• SnowballFilter</p>
        <sec id="sec-3-3-1">
          <title>English</title>
          <p>• EnglishPossessiveFilter
• KStemFilter</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>French</title>
          <p>• FrenchLightStemmer
• ElisionFilter</p>
          <p>While the techniques mentioned above are valid for both English and French analyzing, some
techniques specific to a language have been implemented.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Indexing</title>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Searching</title>
        <p>3.5.1. BM25
Once the documents analysis is completed, the next step is indexing. It is a crucial phase, since it
creates a searchable database, the index, that contains essential metadata about parsed documents. This
metadata includes details like the words and phrases within each document, their frequency, and their
location within the document. By structuring documents in this manner, we improve retrieval eficiency,
enabling users to search for documents based on keywords or phrases with ease. In order to do that, we
indexed the parsed documents through an inverted index, using two diferent fields :
So basically, every parsed document is saved in the indexer with these two fields, one used to identify
the document itself and the other representing its whole content.</p>
        <p>The Searcher serves as the component responsible for interpreting input queries and searching through
indexed documents to identify those that fit best ( "best match") with the query. Then it retrieves these
documents and presents them back to the user.</p>
        <p>The next step is to fetch the pertinent documents based on the given queries: this involves identifying
the most similar documents to our queries using various scoring functions. So we assign scores to each
document in our collection, ranking them from highest to lowest. The highest-ranked document is
presumed to be the most relevant to the given query.</p>
        <p>
          The BM25 ranking function, that belongs to the “BM family” of retrieval models (BM stands for Best
Match), in addition of being simple and efective, seems to be very competitive compared to more
modern techniques [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In Section 5 we used 1 = 1.2 and  = 0.75 as parameters for BM25 (the default
ones from Apache Lucene).
3.5.2. Queries
Queries are the bridge between user information needs and the underlying document corpus. LongEval
provides query datasets in TSV (Tab-Separated Values) format, structured to include query identifiers
(num) and corresponding textual queries text. Each line in the TSV file represents a single query, with
the query identifier and text separated by a tab character. Follows an example from LongEval 2024 Test
Collection [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]:
        </p>
        <p>q062228 aeroport bordeaux.</p>
        <p>
          Once TSV queries are loaded and parsed they’re transformed into an Object, that stores num and text.
This is then submitted to the index searcher for retrieval of relevant documents, generating ranked lists
of documents based on their relevance to each query. Before submitting them to the actual searcher,
queries are parsed using Lucene Query Parser [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], since this package also provides many powerfull tools
to modify query terms and implement stategies like fuzzy search, proximity search and term boosting.
        </p>
        <sec id="sec-3-5-1">
          <title>3.5.3. Proximity Search</title>
          <p>In order to improve the performance of our IR system, one possibility can be using proximity search,
which allows us to search for a document based on how closely two or more search terms of the query
appear in the document. The distance between the two terms is given by a parameter k, which depends
on the context and the length of the documents. For example, the query "red brick house" could be used
to retrieve documents that contain phrases like "red house of brick" or "house made of red brick", while
in the meantime avoiding documents where the words are scattered or spread across. Later, we will
discuss the improvements given by this kind of search.</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>3.5.4. Synonyms</title>
          <p>
            Another way to improve the performance of our IR system is using query expansion. Query expansion
is a technique that consists in reformulating the queries to better match relevant documents. There
are several ways to perform it; the one we tried was synonym query expansion. With this approach,
each query term is expanded with its own synonyms. To find all the synonyms of every english word
we used Wordnet, that is a large lexical database of English containing all the words (nouns, verbs,
adjectives, . . . ) grouped into sets of cognitive synonyms [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. The same methodology was also applied
to all other Europian languages in EuroWordNet project. A list of words with associated synonyms,
retrieved from these databases, is available in our repository for both English and French. This approach
does not always lead to improvements, indeed there are cases that lead to worse system performance.
The results of our experiment will be shown in section 5.
          </p>
        </sec>
        <sec id="sec-3-5-3">
          <title>3.5.5. Reranking</title>
          <p>After the searching process has retrieved the highest ranked documents with respect to the employed
criteria, we can apply a second phase of ranking. This phase is not going to look through all the
documents again, but instead it is going to work with the documents the first phase retrieved. In our
case, we will take only the first  documents for eficiency reasons. The value of  will be discussed
later. The following diagram shows the process flow happening during the reranking phase.</p>
          <p>The reranking approach we use is based on machine learning, specifically on sentence embedding
models.</p>
          <p>Essentially, we employ a pre-trained model that maps text to a vector in a multidimensional space.
For each one of the topics we are processing, we compute its vector. Then we take the first  documents
retrieved by Lucene’s searcher and compute their vector. We then compute a score for each match
using the dot product. The resulting value is used to rerank the documents retrieved by the search by
increasing their score accordingly.</p>
          <p>For each one of the  documents, its updated score will be computed as follows:
 ←</p>
          <p>+  · sim (emb(), emb()) .</p>
          <p>
            Where  is a coeficient that is used to decide how much to value the output of the sentence embedder;
sim is the function that computes the similarity between the two vectors. In our system we use one
provided by the sentence_transformers python package [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]; emb computes the embedding of a
piece of text. In our example, it is used to compute the vector for the query and for the i-th document.
This process will be applied to all the queries.
          </p>
          <p>
            The model we used is all-mpnet-base-v2 and it is freely available from HuggingFace [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. It is
trained mostly on English data, from A Repository of Conversational Datasets by Henderson et al. [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ],
so we expect a greater increase in performance when using it for the translated documents. As we will
see, this is the case.
          </p>
          <p>
            The architecture of the model is based on Microsoft’s MPNet, which stands for Masked and Permuted
Language Modeling [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. MPNet combines the best parts of BERT and XLNet. It uses a special training
[
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] that mixes up the order of words to learn their connections while still understanding the context from
both directions, like BERT does. This hybrid approach allows MPNet to learn contextual representations
more efectively, with improved performance on various natural language understanding tasks as the
result. The model is formed by 12 transformer layers, with each one of them having 768 hidden units and
12 attention heads. Given its training on extensive English datasets from Henderson et al.’s repository,
all-mpnet-base-v2 excels in tasks involving English text, and as demonstrated, it shows significant
improvements when applied to translated documents. As we will se in Section 5, it will allow us to
improve our information retrieval system on the original French documents as well.
          </p>
          <p>
            As a similarity function we used the dot_score function from the sentence_transformers
package. This function computes the dot product between the two vector associated with the documents
[
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
          </p>
          <p>Figure 4 illustrates the reranking algorithm that has been implemented, detailing the number of
documents after each phase. To clarify, N represents the number of top-ranked documents retrieved in
the initial search phase, while k has already been mentioned. As it can be seen from the diagram, only
scores for the first k documents are updated, while the remaining N-k documents aren’t reranked.</p>
          <p>
            As we previously said, to interact with the model we used the sentence-transformers library in
python, and in order to interact with python from the java code, we use a flask HTTP server [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. We
will then use HTTP requests from the Java code to get the score of a document with respect to a query.
          </p>
          <p>Althought this adds a little overhead, it was much simpler than interacting with the sentence embedder
directly from Java.</p>
          <p>All the experiments concerning the reranker are available in Section 5.4.</p>
        </sec>
        <sec id="sec-3-5-4">
          <title>3.5.6. N-gram word model</title>
          <p>
            This section briefly explains another technique for improving SE results, called the N-gram word model.
The idea behind it is that phrases are particularly significant in the IR field. Specifically, statistics show
that most two or three-word queries are phrases. Because of this, it is important to consider multiple
words as phrases rather than independent words. However, the impact of using phrases can be complex,
so it is essential to be careful when using them. Let’s consider the following definition of the word
phrase: Phrase is any sequence of n words, also known as an n-gram. Sequences of two words are called
bigrams, while sequences of three words are called trigrams. The greater the frequency of occurrence of
a word n-gram, the higher the probability that it corresponds to a meaningful phrase in the language.
In the context of IR, n-grams are used to index and retrieve documents based on user queries [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>The setup used to run the experiments is the following:
• The system has been run on the collections available from clef-longeval.github.io/data/. All the
evaluation has been performed on the 2024 Training Set;
• To evaluate the performance of the SE we used trec_eval 9.0.7, available at trec.nist.gov/trec_eval/;
• The repository with the code can be found at bitbucket.org/upd-dei-stud-prj/seupd2324-dam/.
We ran most of the experiments on the following hardware:</p>
      <p>
        For reproducibility reasons, the runs were also submitted to TIRA [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Since some of the experiments
utilize the reranker, the container is given access to a GTX 1080 video card.
4.1. TIRA
TIRA is a key platform for research in information retrieval, designed to facilitate blinded and
reproducible experiments. Generally, studies in this field sufer from a lack of reproducibility, since typically
only test collections and research papers are shared, requiring third parties to rebuild software to test
new datasets. To address this, TIRA has been upgraded to ease task setup and software submission,
scaling eficiently from local setups to cloud-based systems using parallel CPU and GPU processing.
Overall, TIRA enhances the conduct of AI experiments, ensuring both secrecy and repeatability, and
improving the reliability and progression of research in information retrieval.
      </p>
      <p>
        In particular for Information Retrieval TIREx (Information Retrieval Experiment Platform) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] has
been developed. It integrates ir_datasets, ir_measures, and PyTerrier [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] with TIRA to promote more
standardized, reproducible, scalable, and even blinded retrieval experiments.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>Before exploring the results our system is able to obtain, it is useful to list all the possible components
we can make use of and assign them a keyword. Such keyword will be used in the name of the run to
identify it.</p>
      <p>• FR: if the indexing and searching process was done on the French version of the documents.
• EN: if the indexing and searching process was done on the English version of the documents.
• Snowball: if the Snowball stemmer was used.
• Krovetz: if the Krovetz stemmer was used.
• FrenchLight: if the FrenchLight stemmer was used.
• Poss: if the English possessive filter was used.
• Elision: if the ElisionFilter was used.
• Stop: if the list of stopwords was used. For the English documents we used a default one provided
by Lucene. For the French documents we used a custom list that is available in our repository. It
is important to note that for french documents we used both lists. This was done because the
documents contain some paragraphs in English language.
• ICU: if ICU folding was used.
• Prox: if proximity search was used. If this keyword is present, we will write also the distance
parameters, as explained in Section 3.5.3. For example, if proximity search at a distance of 50
characters is employed, we will write Prox(50).
• Reranking: if reranking was used. This will come with a parameter too: the  value we discussed
in Section 3.5.5. The  parameter instead will take a fixed value of 5. As an example, if the system
reranks the first 50 documents, we will write Reranking(50).
• Syns: if the query expansion technique using synonyms was used.</p>
      <p>• Shingles: if word N-grams are being used. In our case we will generate ngrams of length 2 and 3.</p>
      <p>All runs also employ a filter that discards all tokens shorter than 2 characters and longer than 20.
Other than that, by default, all the tokens are transformed into lowercase. All the experiments have
been run using BM25 with default parameters as the ranking function.</p>
      <p>
        The two main metrics we are going to use to compare runs are the Normalized Discounted Cumulative
Gain (nDCG) and the Mean Average Precision (MAP). The reason why we chose these metrics is that
they are widely used in IR tasks, since they ofer a comprehensive understanding of the efectiveness
of retrieval systems. Additionally, they are simple and easy to understand and interpret. nDCG helps
measure how close the system’s output is when compared to an ideal run, where all the items are
sorted in decreasing order of relevance. Moreover, since nDCG is a normalized metric, it enables fair
comparisons between diferent lists of varying lengths and relevance distributions [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. MAP, on the
other hand, calculates the average precision (AP) across all relevant documents, giving insight into the
system’s overall precision and recall. MAP decreases more rapidly if there are non-relevant items at
the top [21]. By deciding to calculate both previously mentioned metrics, we can assess the system’s
performance from diferent perspectives.
      </p>
      <sec id="sec-5-1">
        <title>5.1. Baseline Lucene performance</title>
        <p>To set a baseline for our next runs, we ran a baseline Lucene configuration on both the French and
the English documents. This baseline configuration will just use the standard tokenizer, the Snowball
stemmer, the lower case filter and the BM25 as retrieval model.</p>
        <p>The results we obtained are shown in Table 1.
nDCG</p>
        <p>As we can see, the performance on the “original” French dataset is better than the on on the translated
English dataset. nDCG and MAP are 23.96% and 38.59% better respectively.</p>
        <p>This could be explained by the fact that the automated translation is not very accurate and some
information is lost in the process.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Choice of stemmer</title>
        <p>As described in Section 3, there are several stemmers that can be used to derive word roots. We
made some runs to decide the best one for English and French respectively. For both collections, the
corresponding stoplists were used to conduct the experiments.</p>
        <p>FrenchLight stemmer, compared with SnowBall, improves the performance of our system,
incrementing both nDCG and MAP of 1.47% and 2.06%, respectively.</p>
        <p>In this case, SnowBall stemmer improves the performance of our system compared to Krovetz. Both
nDCG and MAP increase of 1.86% and 3.18%, respectively.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Synonyms</title>
        <p>In this section we are going to talk about synonym query expansion. We tried this approach both for
English and French. The experiment setup was:
• english stoplist, possessive filter and SnowBall stemmer for English collection.</p>
        <p>• french stoplist, elision filter, ICU filter and FrenchLight stemmer for French collection.
The synonyms included in the queries are analyzed with the same analyzer used for indexing the
documents and, furthermore, are assigned a weight of 0.7 to reduce their significance compared to the
words originally present in the queries. The results are reported in Tables 4 and 5.
nDCG
nDCG</p>
        <p>As we can see, in both cases the synonym query expansion worsened the performance of our
information retrieval system. One of the reasons why this is happening could be that synonyms that are
not related to the queries are still included in them because, maybe, they have the same root. So, in the
following experiments we completely discard synonyms to focus more on other methods to improve
the efectiveness of our system.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Choice of the reranker parameter</title>
        <p>Since, as it was covered in Section 3.5.5, we need to decide a threshold for the number of documents to
undergo reranking, we made some runs to decide the best value.</p>
        <p>We will use the reranker for both experiments with the English documents and French documents,
so we tried it on both the collections. In addition to the techniques used in the previous experiment, we
also added a stopword list.</p>
        <p>For the French collection the stopword list, ICU folding and the elision filter have been used as well.
The runs on the English collection made use of the English possessive filter.
nDCG</p>
        <p>As we can see, the reranker allow us to improve the performance of the retrieval system. Between the
run with no reranking ( = 0) and the one with  = 200, we notice an increase in the nDCG of 2.97%.</p>
        <p>The best value of MAP occurs instead for  = 150, with an increase of 9.25%.
nDCG</p>
        <p>For what concerns the English collection, the best result is obtained with  = 200, with an increase
of nDCG and MAP of, respectively, the 7.03% and 17.66%.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Training results</title>
        <p>In this section we are going to compare the best configurations for various parameters in both French
and English languages. These configurations represent the runs we have chosen to submit to the
LongEval Conference:
• The best system that doesn’t use reranking for English language;
• The best system for English language, using reranking and SnowBall stemmer;
• The best system that doesn’t use reranking;
• The best overall system. This is the system that achieved the best results using reranking and
other parameters discussed in the previous sections;
• A system using word N-grams.</p>
        <p>The results of these systems are displayed in the following Table 8.</p>
        <p>EN-Stop-SnowBall-Poss-Prox(50)
EN-Stop-SnowBall-Poss-Prox(50)-Reranking(200)</p>
        <p>FR-Stop-FrenchLight-Elision-ICU-Prox(50)</p>
        <p>FR-Stop-FrenchLight-Elision-ICU-Prox(50)-Reranking(150)
FR-Stop-FrenchLight-Elision-ICU-Shingles-Prox(50)-Reranking(150)
nDCG</p>
        <p>One first notable observation is that our search engine generally performs better on French documents.
As we previously noted, this trend may be attributed to the fact that the original corpus was written in
French and later translated into English.</p>
        <p>A further analysis of the results reveals that re-ranking leads to the best performance for both languages.
However, the introduction of N-grams (Shingles) technique significantly decreases the search engine’s
performance, nullifying the benefits of re-ranking and resulting in an even worse outcome compared to
the best-performing configuration without this approach.</p>
        <p>Figure 5 shows the interpolated precision-recall curves of the systems described in Table 8. This
type of plot is useful to represent the trend of two diferent measures like precision and recall. In this
way we can compare diferent systems to figure out which one is better than the others. System 5
intersects System 3 curve at recall 0.2. Hence, System 5 performs better at the high ranks and worse
at the low ranks. Regarding other systems, the curves never intersect, so their performance are more
clearly separated. Accordingly to the results reported in Table 8, System 4 is the best.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.6. Test results</title>
        <p>To conclude how well the submitted runs perform, we analyzed nDCG and MAP measures, and also
performed Statistical Hypothesis Testing (SHT). The analysis was done for each of the five submitted
runs, for both Short-term and Long-term collections. Short-term represents the collection from June
2023, while Long-term represents the collection from August 2023. This kind of analysis is essential as
it shows how well each IR system performs. Moreover, it is particularly significant to compare these
results since the purpose of the LongEval task is to find an IR system that can handle changes over time,
meaning that we are interested in systems that have low measure drops over time. Table 9 shows the
performance drop of the systems between June and August.</p>
        <p>Additionally, SHT was performed to determine if the systems performance is statistically diferent
or not. This is important to understand whether some configurations improve the system or not, for
example, if the usage of reranking has a real efect on the performance or the mean increase is due only
to the variance of the test.</p>
        <p>In this section, we will further explain SHT, present the results for the previously mentioned metrics
as well as for SHT, and provide our conclusions about them.</p>
        <sec id="sec-5-6-1">
          <title>5.6.1. Statistical Hypothesis Testing</title>
          <p>SHT is a type of statistical analysis used to estimate the relationship between statistical variables. Later
in this section, the results produced by performing SHT on both Short-term and Long-term datasets
will be shown and explained. SHT is important for determining in a scientifically valid way whether
our systems are performing similarly or diferently. In other words, we are interested in knowing if
there is a statistical significant diference between them.</p>
          <p>In order to perform SHT, the two mutually exclusive hypotheses H0 and H1 must be defined. H 0 is
called the null hypothesis while H1 is the alternative hypothesis. Beside these hypotheses a threshold 
must be defined, representing a significance level. For example,  = 0.05 means there is a 5% probability
of wrongly declaring that the systems are diferent. SHT uses sample data to determine if H 0 can be
rejected. If that is the case, it means that the alternative hypothesis H1 is true [22].</p>
          <p>In particular, we will use Two-Way Analysis of Variance (ANOVA2) as a statistical test. It examines
the influence of two diferent variables which are, in our case, the systems and the topics. ANOVA2 is
used to evaluate the diference between the means of more than two groups, which is useful in our case
since we have 5 diferent IR systems. In ANOVA2 the hypotheses are as follows:
• H0 - the means of all groups are equal,
• H1 - at least 2 groups have diferent means [23].</p>
        </sec>
        <sec id="sec-5-6-2">
          <title>5.6.2. Short-term</title>
          <p>In this section, results for nDCG and AP measures, as well as for SHT, are presented for Short-term
collection. Additionally, conclusions for the given results are provided.</p>
          <p>Table 9 displays nDCG and MAP measures of the five submitted systems for the Short-term collection.
It is evident that the best-performing system for the training data, based on both measures, remains
consistent: FR-Stop-FrenchLight-Elision-ICU-Prox(50)-Reranking(150). However, there is a notable
performance drop of 21.12% for nDCG and 19.16% for MAP. This drop was expected, as user queries
and preferences have evolved over time. Notably, the systems with English configurations experience
the largest drops. This phenomenon could be related to the automated translation process from French
to English.</p>
          <p>Based on the box plots in Figures 6 and 7, it is notable that the FR systems tend to have higher
median values for both measures compared to the EN systems, indicating generally better performance.
Additionally, the FR systems do not have as many outliers as the EN systems, which shows a more
consistent performance.</p>
          <p>Table 10 and Table 11 show the results of the ANOVA2 test.</p>
          <p>The SS column shows the total variability for each source. Higher values indicate greater variability. df
indicates how many degrees of freedom a source of variation has. For example, since the matrix of the
scores contains a system for each column, and 5 systems are being compared, the columns will have 4
degrees of freedom.</p>
          <p>The MS column is the average of the sum of squares (SS) for each source, calculated by dividing SS
by its corresponding degrees of freedom (df). It represents the variance for each source.</p>
          <p>F shows the F-statistic, calculated as the ratio of the mean squares of the source to the mean squares
of the error. It is used to determine if the observed variance between groups is significantly greater
than the variance within groups.</p>
          <p>Finally, the column we are most interested in, Prob&gt;F, indicates the p-value associated with the
F-statistic. A lower p-value (in our case less than 0.05) suggests that the diferences between group
means are statistically significant.</p>
          <p>In this case, both Table 10 and Table 11, allow us to state that the 5 systems have diferent performance
with a probability of being wrong very close to 0.</p>
          <p>After discovering that the 5 systems are diferent, it is useful to compare systems pairwise in order
to understand where the diference comes from. For this purpose, we will employ Tukey’s Honest
Significant Diference (HSD) test.</p>
          <p>(a) nDCG</p>
          <p>(b) AP
In this section, results for nDCG and AP measures, as well as for SHT, are presented for Long term
collection. Additionally, conclusions for given results are provided.</p>
          <p>Table 9 shows the nDCG and AP scores of the five systems submitted for the Long-term collection.
It’s clear that the best-performing system, based on both measures, remains consistent:
FR-StopFrenchLight-Elision-ICU-Prox(50)-Reranking(150). However, there’s a significant performance decline
of 41.45% for nDCG and 44.47% for MAP between training and Long-term data. This decline was
expected, given the evolution of user queries and preferences over time.</p>
          <p>Based on the box plots in Figures 9 and 10, we can make similar conclusions we made when comparing
box plots for the Short-term collection - the FR systems showed better performance for both measures,
compared to the EN systems. Moreover, the FR systems have fewer outliers than the EN systems.
When comparing the box plots 9 and 10 for the Long-term collection with the box plots 6 and 7 for the
Short-term collection, we can conclude the following: all five IR systems show a slight decrease in the
median nDCG and AP values over time, which was expected. Another notable observation is that the
box plots for the Long-term collection have more outliers than those for the Short-term collection. This
behavior could be explained with various factors, such as the impact of diferent queries and documents
and changes in user behavior over time.</p>
          <p>In the same fashion as what was done for the Short-term dataset in Section 5.6.2, we run the ANOVA2
test and the pairwise comparison using the HSD test. Table 14 and Table 15 show the results of the
ANOVA2 test. The output of the systems comparison is reported in Table 16 and Table 17 for AP and
nDCG, respectively. The p-values lower than  = 0.05 are shown in bold, meaning that we refuse
to accept the null hypothesis. One notable thing is that, on these datasets, all the systems, except for
System 3 and System 5 are statistically diferent.</p>
          <p>(a) nDCG</p>
          <p>SS</p>
          <p>(b) AP</p>
          <p>System 2
System 3
System 4
System 5
System 3
System 4
System 5
System 4
System 5
System 5</p>
          <p>P-value</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>In this section we summarize the main achievements and the conclusions we reached during the
development of the SE. Firstly, using a basic configuration (with the standard tokenizer, the Snowball
stemmer, and the lowercase filter), we observed better performance on the French dataset compared
to the English dataset. We could explain this behavior with the fact that the translation to English is
automated.</p>
      <p>Secondly, the default performance shown in Section 5 is quite good, but after experimenting with
diferent configurations, we managed to find better parameters. It’s important to mention that we
always used the appropriate stopword list, as we noticed an improvement in performance when utilizing
them. Moreover, the SE performed better when using an appropriate stemmer, while the usage of
synonyms had decremental efects on it. Another technique that brought significant progress was the
usage of a reranker for both languages.</p>
      <p>While analyzing performance drops and SHT results for both Short-term and Long-term collections,
we arrived at the following conclusions. There is a significant diference between the systems with
French and English configurations in terms of performance. The FR systems consistently outperform
the EN systems in both datasets. As already mentioned, this is related to the automated translation from
French to English and the fact that the language evolves over time [24]. This demonstrates that
languagespecific optimizations play a vital role in the efectiveness of retrieval systems. Moreover, systems with
reranking generally perform better than non-reranking systems. Furthermore, the performance drop
over time is evident, and it highlights the need of continuous updates to maintain performance over
time.</p>
      <p>Regarding future work, we could try to improve query expansion with the use of Large Language
Models (LLMs), which have shown remarkable capabilities in the IR field, particularly in text
understanding [25]. Indeed, in order to better perceive the user’s intent and create a more eficient query,
we could reformulate it. For this purpose, we could use a language model to rephrase the query and
hopefully increase the performance of the SE. These models can grasp the context and meaning of text,
enabling more accurate retrieval of relevant documents [26]. Another idea that could be beneficial to
the performance of our system would be to use a sentence embedder model trained specifically on
French data. This hypothetically could increase the boost in performance obtained with the use of the
reranker even more.
[21] Ren Jie Tan, Breaking Down Mean Average Precision (mAP), https://towardsdatascience.com/
breaking-down-mean-average-precision-map-ae462f623a52, 2019. [Online; accessed: 2024-05-02].
[22] Christina Majaski, Hypothesis Testing: 4 Steps and Example, https://www.investopedia.com/
terms/h/hypothesistesting.asp, 2024. [Online; accessed: 2024-05-15].
[23] Will Kenton, What Is Analysis of Variance (ANOVA)?, https://www.investopedia.com/terms/a/
anova.asp, 2024. [Online; accessed: 2024-05-15].
[24] Rabab Alkhalifa, Elena Kochkina, Arkaitz Zubiaga, Building for tomorrow: Assessing the temporal
persistence of text classifiers, https://arxiv.org/abs/2205.05435, 2022. [Online; accessed:
2024-0522].
[25] Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan
Chen, Zhicheng Dou, Ji-Rong Wen , Large Language Models for Information Retrieval: A Survey,
https://arxiv.org/pdf/2308.07107, 2024. [Online; accessed: 2024-05-04].
[26] Vishal Gupta, Ashutosh Dixit, Shilpa Sethi, An Improved Sentence Embeddings based Information
Retrieval Technique using Query Reformulation, https://ieeexplore.ieee.org/document/10141788,
2023. [Online; accessed: 2024-05-04].</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Sushilkumar</given-names>
            <surname>Chavhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raghuwanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dharmik</surname>
          </string-name>
          ,
          <article-title>Information Retrieval using Machine learning for</article-title>
          <source>Ranking: A Review</source>
          , https://iopscience.iop.org/article/10.1088/
          <fpage>1742</fpage>
          -
          <lpage>6596</lpage>
          /
          <year>1913</year>
          /1/012150/meta,
          <year>2021</year>
          . [Online; accessed:
          <fpage>2024</fpage>
          -05-22].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          <article-title>, Multi-stage document ranking with bert</article-title>
          ,
          <year>2019</year>
          . URL: https: //arxiv.org/abs/
          <year>1910</year>
          .14424. arXiv:
          <year>1910</year>
          .14424.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bolzonello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Marchiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moschetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Trevisiol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zanini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          , et al.,
          <article-title>Seupd@ clef: Team faderic on a query expansion and reranking approach for the longeval task</article-title>
          ,
          <source>in: CEUR WORKSHOP PROCEEDINGS</source>
          , volume
          <volume>3497</volume>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>2252</fpage>
          -
          <lpage>2280</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Lucene Website</surname>
          </string-name>
          , https://lucene.apache.org/,
          <year>2024</year>
          . [Online; accessed:
          <fpage>2024</fpage>
          -05-30].
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jackson</given-names>
            <surname>Github</surname>
          </string-name>
          <string-name>
            <surname>Page</surname>
          </string-name>
          , https://github.com/FasterXML/jackson,
          <year>2024</year>
          . [Online; accessed:
          <fpage>2024</fpage>
          -06-01].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          , et al.,
          <article-title>The probabilistic relevance framework: Bm25 and beyond</article-title>
          ,
          <source>Foundations and Trends® in Information Retrieval</source>
          <volume>3</volume>
          (
          <year>2009</year>
          )
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>LongEval</given-names>
            <surname>2024 Test Collection</surname>
          </string-name>
          , https://doi.org/10.48436/xr350-
          <fpage>79683</fpage>
          ,
          <year>2024</year>
          . [Online; accessed:
          <fpage>2024</fpage>
          -06-01].
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Lucene</given-names>
            <surname>Query Parser Documentation</surname>
          </string-name>
          , https://lucene.apache.org/core/9_10_0/queryparser/org/ apache/lucene/queryparser/classic/package-summary.html,
          <year>2024</year>
          . [Online; accessed:
          <fpage>2024</fpage>
          -06-02].
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          (Ed.),
          <source>WordNet: An Electronic Lexical Database</source>
          , MIT Press, Cambridge, MA,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Wordnet: A lexical database for english</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>38</volume>
          (
          <year>1995</year>
          )
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Nils</surname>
            <given-names>Reimers</given-names>
          </string-name>
          , Iryna Gurevych, Sentence-BERT:
          <article-title>Sentence Embeddings using Siamese BERTNetworks</article-title>
          , https://arxiv.org/abs/
          <year>1908</year>
          .10084,
          <year>2019</year>
          . [Online; accessed:
          <fpage>2024</fpage>
          -05-05].
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <article-title>all-mpnet-base-v2 Documentation</article-title>
          , https://huggingface.co/sentence-transformers/
          <article-title>all-mpnet-base-</article-title>
          <string-name>
            <surname>v2</surname>
          </string-name>
          ,
          <year>2024</year>
          . [Online; accessed:
          <fpage>2024</fpage>
          -06-05].
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Budzianowski</surname>
          </string-name>
          , I. Casanueva,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coope</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gerz</surname>
          </string-name>
          , G. Kumar,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mrkšić</surname>
          </string-name>
          , G. Spithourakis,
          <string-name>
            <given-names>P.-H.</given-names>
            <surname>Su</surname>
          </string-name>
          , I. Vulic,
          <string-name>
            <given-names>T.-H.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>A repository of conversational datasets</article-title>
          ,
          <source>in: Proceedings of the Workshop on NLP for Conversational AI</source>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1904</year>
          .06472, data available at github.com/PolyAI-LDN/conversational-datasets.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Kaitao</surname>
            <given-names>Song</given-names>
          </string-name>
          , Xu Tan, Tao Qin, Jianfeng Lu,
          <string-name>
            <surname>Tie-Yan</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <article-title>MPNet: Masked and Permuted Pretraining for Language Understanding</article-title>
          , https://arxiv.org/abs/
          <year>2004</year>
          .09297,
          <year>2020</year>
          . [Online; accessed:
          <fpage>2024</fpage>
          -05-22].
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Flask</surname>
            <given-names>Documentation</given-names>
          </string-name>
          , https://flask.palletsprojects.
          <source>com/en/3</source>
          .0.x/,
          <year>2024</year>
          . [Online; accessed:
          <fpage>2024</fpage>
          -06- 01].
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>W.</given-names>
            <surname>Bruce</surname>
          </string-name>
          <string-name>
            <surname>Croft</surname>
          </string-name>
          , Donald Metzler, Trevor Strohman, Search Engines - Information Retrieval in Practice, https://ciir.cs.umass.edu/irbook/,
          <year>2015</year>
          . [Online; accessed:
          <fpage>2024</fpage>
          -05-01].
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kolyada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Grahm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elstner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Loebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>Continuous Integration for Reproducible Shared Tasks with TIRA.io</article-title>
          , in: J.
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Kruschwitz</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Caputo (Eds.),
          <source>Advances in Information Retrieval. 45th European Conference on IR Research (ECIR</source>
          <year>2023</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2023</year>
          , pp.
          <fpage>236</fpage>
          -
          <lpage>241</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>031</fpage>
          -28241-6_
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Reimer</surname>
          </string-name>
          , S. MacAvaney,
          <string-name>
            <given-names>N.</given-names>
            <surname>Deckers</surname>
          </string-name>
          , S. Reich, J.
          <string-name>
            <surname>Bevendorf</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>The Information Retrieval Experiment Platform</article-title>
          , in: H.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <string-name>
            <surname>Duh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          <string-name>
            <surname>Kato</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Poblete (Eds.),
          <source>46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>2023</year>
          ), ACM,
          <year>2023</year>
          , pp.
          <fpage>2826</fpage>
          -
          <lpage>2836</lpage>
          . doi:
          <volume>10</volume>
          .1145/3539618.3591888.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          , S. MacAvaney, I. Ounis, Pyterrier:
          <article-title>Declarative experimentation in python from bm25 to dense retrieval</article-title>
          ,
          <source>in: Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM</source>
          <year>2021</year>
          ),
          <year>2021</year>
          , pp.
          <fpage>4526</fpage>
          -
          <lpage>4533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Aparna</surname>
            <given-names>Dhinakaran</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demystifying</surname>
            <given-names>NDCG</given-names>
          </string-name>
          , https://towardsdatascience.com/ demystifying-ndcg-bee3be58cfe0,
          <year>2023</year>
          . [Online; accessed:
          <fpage>2024</fpage>
          -05-02].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>