<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Team 3DS2A at LongEval: Performance Evaluation over Time of IR Systems with Proximity Search and Reranking Components</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Bruttomesso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Cavazza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Corrò</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Peraro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Seghetto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Padua</institution>
          ,
          <addr-line>Via 8 Febbraio, 2 - 35122 Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This report gives an overview of the system developed by Team 3DS2A for Task 1 of the LongEval Lab at CLEF 2025. The team members are students enrolled in the Computer Engineering master's program at the University of Padua. The team developed an information retrieval system tailored to run on a French-language document corpus. The system was evaluated over a nine-month timespan to assess its robustness in terms of precision and recall over time. Throughout development, multiple techniques were explored, including alternative text analyzers, proximity search, chunk-based indexing, and semantic reranking using sentence embeddings. This report presents the system architecture, the experimental strategies adopted, and the performance achieved during the training phase.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;CLEF</kwd>
        <kwd>LongEval 2025</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Search Engine</kwd>
        <kwd>Reranking</kwd>
        <kwd>Word Embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In today’s digital age, online searching has become an integral part of daily life, with billions of users
worldwide relying on search engines to quickly and accurately find the information they need. A search
engine (SE) is a software system designed to fulfill this demand by processing user queries that express
specific information needs. However, as the number of web pages continues to grow rapidly, search
engines face increasing challenges, one of the most significant being a decline in performance over time.
To address this issue, the LongEval Lab, organized by the Conference and Labs of the Evaluation Forum
(CLEF), encourages the development of temporal information retrieval (IR) systems capable of adapting
to the evolving nature of online content [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This work reports on the proposed solution implemented
by the 3DS2A group at the University of Padua as part of the Search Engines Course.
      </p>
      <p>Alongside the traditional search pipeline, the system also utilizes more sophisticated techniques like
chunk-based search, pseudo-relevance feedback, and reranking based on a sentence embeddings model
to enhance the retrieval capability of the queries used.</p>
      <p>The paper is organized as follows: Section 2 briefly introduces some related works for past LongEval
tasks at CLEF 24; Section 3 describes our approach; Section 4 explains our experimental setup; Section
5 discusses our main findings; Section 6 presents the statistical analysis performed to assess the
significance of performance diferences between systems; finally, Section 7 draws some conclusions
and outlooks for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The LongEval 2025 task follows the same general structure as previous CLEF LongEval editions, with
the key diference that this year’s training set spans a significantly longer temporal window.</p>
      <p>
        In past editions, a wide range of information retrieval techniques have been explored, with sentence
embedding-based approaches playing a prominent role, especially in the re-ranking phase [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For
example, Basaglia et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] adopted a reranking strategy based on sentence embeddings, demonstrating
its efectiveness in improving retrieval performance on temporally dynamic datasets.
      </p>
      <p>
        While the discussion about whether stemming or lemmatizing achieves better performance is still
open [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], another common technique involves the expansion of query terms [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and the use of
Named Entity Recognition [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>Our objective is to explore some of these techniques, combining them into new approaches while
participating at the LongEval Web retrieval task.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        This section describes the methodologies adopted by our IR system for this task. To build our IR system,
we used Apache Lucene 10.1.0 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], an open source search engine library developed in Java, known
for its high performance and rich feature set. Lucene provides a robust framework for indexing and
retrieval, which we extended with a reranking module based on sentence embeddings.
      </p>
      <p>
        Our goal was to enhance the efectiveness of an approach explored in the previous year by Basaglia
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We used a similar approach based on reranking a small chunk of documents, but using a
larger and more expressive sentence embedding model to capture deeper semantic similarities between
queries and documents.
      </p>
      <p>The main components of our system are:
• WebDocumentParser to read and parse a .json file into a WebDocument object.
• WebAnalyzer to analyze the contents of a document and return tokens ready to be indexed.
• DirectoryIndexer to manage the opening, parsing and indexing of the contents in a specific
directory.
• Searcher to load the provided queries and perform the matching between those and the indexed
documents.</p>
      <sec id="sec-3-1">
        <title>3.1. Parsing</title>
        <p>The first step in IR systems is about collecting the documents and provide them to the system in a
"cleaned" and organized way. For this purpose, two classes have been developed in our system: the
WebDocument and WebDocumentParser classes, which are responsible for reading the document
and abstracting them in an organized Java object. We developed our parser class to read through the
.json file format using Jackson XML Parser, and convert them into the specific Java class WebDocument.</p>
        <p>The structure of the .json files is rather simple, and it is made as an array of anonymous elements
with only two fields: id and contents, therefore the associated Java class will have the same structure.
The most important thing about the parsing process is that it also provides a pre-processing step to
clean the contents of the document. This step does not aim to perform detailed text analysis, as that
responsibility is delegated to the analyzer component, therefore we just focus on the removal of HTML
tags, of unwanted punctuation, of URLs and normalization of consecutive whitespaces.</p>
        <p>All of this happens while accessing the list of documents in the .json file, so the process is performed
only when actually accessing the required document when iterating the collection, rather than processing
the whole collection at a single time. The WebDocumentParser provides access to the documents
contained in a single file through the canonical iteration methods and returns the WebDocument
object ready to be analyzed by the consumer WebAnalyzer.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Analyzing</title>
        <p>Once we have access to the contents of our document, we must extract a sequence of textual elements
called tokens - to be provided to the indexer. These tokens represent the basic units upon which the
search process operates. The main purpose of the analyzer component is to provide those tokens and
apply some filtering and optimizations to improve the performance of our system.</p>
        <p>To perform these operations, we developed our implementation of analyzer, called WebAnalyzer,
which provides access to tokens through the createComponents method. The pipeline to generate
a token is divided into several steps. First of all, we want to generate tokens from the text using
one of the provided Lucene classes, like the WhitespaceTokenizer, the LetterTokenizer or the
StandardTokenizer. Using one of these allows us to split the text into words or other components,
which we can consequently analyze. After applying a LowerCaseFilter component to the token, to
normalize the capitalization of the text, we want to remove the tokens which may not be relevant for
our purpose: we use a LengthFilter to remove too short or too long tokens (setting thresholds in
length to be in range 3-100), and finally a StopFilter component to eliminate the stopwords. In order
to make the text more uniform, we also apply an ElisionFilter and ASCIIFoldingFilter components
to remove elisions and regularize uncommon characters that may be present in the text, just before
ifltering the stopwords.</p>
        <p>The last and probably most important step in our analysis pipeline is about the stemming process.
We want our system to be able to match terms that might not be exactly written in the same way in the
documents and queries, like (e.g. plural/singular nouns). Therefore, we apply a stemmer component
at the end of the pipeline. Since the stemming process is a very delicate part of the analysis, we tried
diferent configurations of the stemmer component, like the Krovetz, SnowBall and Porter stemmers,
but also the Lucene’s FrenchLight/FrenchMinimal StemFilter, since our collection is made of French
queries and documents. At the end of the process, the WebAnalyzer generates a sequence of tokens
ready to be consumed by our indexer.</p>
        <p>To ensure flexibility and allow for easy experimentation with diferent analysis strategies, the
WebAnalyzer component is configurable using an external XML file. This file specifies the parameters
used in the analysis pipeline, such as the tokenizer type, minimum and maximum token length, the
stemmer to be applied, and the activation of specific filters such as ASCIIFoldingFilter or ElisionFilter.
The configuration is defined using an AnalyzerParams class, which is populated from the XML using
Jackson annotations.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. POS Tagging and Lemmatization</title>
          <p>
            In order to improve the matching performance of our system, we tried to implement diferent versions
of the analysis component. One of the first ideas was to take advantage of the OpenNLP [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] project to
reduce the number of indexed terms and improve search results by expanding some types of words.
In order to implement this kind of configuration, we tried to implement an OpenNLPAnalyzer class
using OpenNLP models for sentence detection, tokenization, POS-tagging, and lemmatization. We also
developed a POSTagFilter Tokenstream component to discard unwanted types of tokens. However,
when trying to perform the analysis of the documents, we couldn’t achieve a suficiently fast execution
time from the system, and therefore we dropped this implementation.
          </p>
          <p>
            A second attempt at lemmatization was made using a more simple model based on Lef [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ][
            <xref ref-type="bibr" rid="ref12">12</xref>
            ],
a large-scale morphological and syntactic lexicon for French. In this case, we statically generated a
copy of the collection documents containing the lemmatized terms, and then we used this collection as
input to our system, but removing the stemming process from the analysis pipeline. In this case, the
running time was very eficient, but we soon discovered that the lemmatization process didn’t improve
the matching performances of our system. Therefore, we dropped also this lemmatization process.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Indexing</title>
        <p>Upon receiving the tokens generated from the analyzer’s pipeline, the system must store the terms in
the index. Lucene stores an inverted index, associating each term with a list of postings, that is the list
of document ids containing that index term. In our configuration to improve the search capabilities of
our system, we configured the index to also store the term frequencies and relative positions so that we
could take advantage of the tf-idf metrics of each term and of phrase search. We also implemented a
more complex indexing procedure, where each document is split into chunks, in order to improve the
precision during the search phase, as better described in 3.3.1.</p>
        <p>To manage the indexing process, we developed the DirectoryIndexer class, which allows us to parse
and index an entire directory containing .json files, especially useful since in our case we are dealing with
a sequence of "snapshot" directories, one for each month. Moreover, an eficient MultiThreadIndexer
class has been developed, so that multiple .json files are indexed at the same time.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Chunk Indexing</title>
          <p>
            In order to enhance the retrieval performance of our system, we also tried an alternative approach
at index-time. Chunk indexing is a strategy designed to enhance retrieval granularity by dividing
documents into semantically coherent text segments - called chunks - before indexing. The efectiveness
of chunk-based approaches has been highlighted in both neural and classical IR frameworks, in particular,
Yin and Schütze[
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] operated on multiple levels of text granularity to improve matching performance
across chunks. Their findings support the idea that representing and comparing textual content at
diferent granular levels leads to improved relevance estimation.
          </p>
          <p>
            Inspired by this principle, our system implements chunk indexing with overlapping windows of 10
sentences, as follows:
1. Sentence-based Segmentation: Documents are segmented into chunks using the fr-sent.bin
French sentence model provided by OpenNLP[
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. Sentences are then grouped in chunks of 10
sentences each.
2. Sliding Window with Overlap: To preserve semantic continuity across chunks, a sliding
window approach is used. Each new chunk begins after reusing the last  sentences (3 in this
case) of the previous chunk as overlap.
3. Unique Identification: Each chunk is indexed as an independent Lucene document with a
unique identifier formatted as &lt;docID&gt;_&lt;chunkIndex&gt;, ensuring traceability to its source
document.
          </p>
          <p>During retrieval, the system operates at the chunk level, but the results are aggregated at the document
level using a simple yet efective post-processing step:
• For each original document, only the highest-scoring chunk is retained.
• The final score assigned to the document corresponds to the score of this best chunk, thus
reflecting its most relevant passage.</p>
          <p>This strategy enables both fine-grained relevance matching and eficient document-level ranking,
particularly beneficial in multilingual or domain-diverse collections. Further considerations on the
impact of chunking on retrieval performance are discussed in Section 5.8.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Searching</title>
        <p>Moving on with the "online" section of our system, the Searcher class is the component tasked with
interpreting user queries and scanning indexed documents to identify the most relevant matches. In
order to perform the searching process, we also developed a TopicParser class, which is responsible to
parse the queries file and pass them with their id to the searcher, in a similar way to what we have seen
when parsing the .json documents. Several components have been tried to further improve the search
phase, where the actual matching is performed through the BM25 model.
3.4.1. BM25
The next step involves retrieving relevant documents based on the queries submitted. This process
entails identifying the documents that match most closely to each query, using scoring functions to
evaluate their relevance. Each document is assigned a score and the results are ranked accordingly, from
highest to lowest, under the assumption that the highest scoring documents are more relevant to the
query. Among these scoring methods, the BM25 ranking function, part of the "Best Match" (BM) family
of retrieval models, stands out. Despite its simplicity, BM25 remains highly efective and continues to
perform competitively against more modern retrieval techniques.
3.4.2. Queries
The queries provided by the LongEval team consist of a set of files, one for each monthly snapshot
provided. It is possible to work with two diferent formats, .trec files or .txt files. We chose to work
with .txt files for simplicity, but a parser was developed for each version in order to generate the
corresponding QualityQuery object required by Lucene.</p>
        <p>Query terms are then analyzed and used to generate more powerful queries, using diferent techniques
to improve the search result. In the following subsections we describe our implemented procedures.</p>
        <sec id="sec-3-4-1">
          <title>3.4.3. Fuzzy Search</title>
          <p>We experimented with a fuzzy search extension after observing a high incidence of typographical
errors in user queries, like "amurerie", which should be "armurerie". By replacing exact term matching
with a Levenshtein-based fuzzy query, we aimed to recover relevant documents despite misspellings.
However, contrary to our expectations, the fuzzy approach consistently underperformed compared to
strict matching: the relaxation of edit distance constraints introduced substantial noise and lowered
precision, ultimately degrading overall retrieval efectiveness. This surprising outcome led us to favor
the original exact-matching pipeline for the final system configuration.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.4.4. Pseudo-Relevance Feedback</title>
          <p>Pseudo-Relevance Feedback (PRF) is an unsupervised technique used to improve retrieval efectiveness
by automatically reformulating the query. Unlike traditional relevance feedback, it does not rely
on human interaction. Instead, PRF expands the user’s query by leveraging terms from top-ranked
documents retrieved during an initial search. The method assumes that the most relevant terms can be
extracted from the top- results and used to enhance recall.</p>
          <p>Our implementation follows these main steps:
1. Initial Retrieval: Execute the original query using BM25 to retrieve the top  documents.
2. Term Extraction: For each of the top documents, tokenize the text using the same analyzer as
during indexing. Filter out stopwords and tokens with inappropriate length. Then, compute term
frequency (TF) and document frequency (DF) per term.
3. Scoring and Ranking: Compute a BM25-inspired weight for each candidate expansion term:
score() = TF() · log
︂(
1 +
 − DF() + 0.5 )︂</p>
          <p>DF() + 0.5
where  is the total number of documents and DF() is the document frequency of term .
4. Query Expansion: Select the top- scoring terms and construct a new query combining:
• The original query, boosted by a weight 
• The expansion terms, boosted by a lower weight 
5. Final Retrieval: Submit the expanded query to obtain the final re-ranked list.</p>
          <p>Formally, the final query  ′ is built as:
′ =  ·  +  ·
∑︁ 
∈exp
where exp is the set of top expansion terms.</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>3.4.5. Proximity Search</title>
          <p>Proximity search enriches traditional term-based retrieval by introducing a measure of how closely query
terms co-occur within a document, which often signals a stronger semantic relationship. Documents in
which the full set of query terms appears within this slop window receive a score boost, reflecting their
contextual cohesion. For example, the query "red house brick" could be used to retrieve documents
that contain phrases like "red house built with bricks", while in the meantime avoiding documents
where the words are scattered or spread across. However, as previously mentioned, the documents
must still contain all query terms to match the full proximity constraint. So to further capture cases
where only subsets of the query are present, we automatically generate every pair and triplet of query
terms and apply the same proximity criterion to each combination. We limit the subsets to pairs and
triplets to avoid exceeding Lucene’s BooleanQuery clause limit of 1024, which would otherwise be
reached quickly with longer queries. These proximity constraints are added as optional clauses in the
ifnal query, ensuring that documents with locally clustered terms are boosted in rank. Later, we will
discuss the tuning of the slop parameter and how its value can influence system performance.</p>
        </sec>
        <sec id="sec-3-4-4">
          <title>3.4.6. Synonyms</title>
          <p>
            In order to try to improve the performance, we implemented a query expansion technique with
synonyms, exploiting the semantic lexicon WOLF (WordNet Libre du Français) [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. This technique
allows words semantically similar to the original term to be included in the query, increasing the
probability of retrieving relevant documents even if expressed with diferent terms.
          </p>
          <p>During the Lucene query construction, each token is analyzed and, if the expansion with synonyms is
enabled, the WolfManager module is queried. This module manages a dictionary of synonyms extracted
from the XML WOLF resource. For each token, if synonyms are present, a SynonymQuery is created
that includes both the original term and its synonyms. This query is then added to the global Lucene
query using the BooleanQuery builder, with the SHOULD operator, to expand the coverage without
penalizing the results based on the original term.</p>
        </sec>
        <sec id="sec-3-4-5">
          <title>3.4.7. Reranking</title>
          <p>
            To improve the quality of the results, the first k retrieved documents are subjected to a semantic
re-ranking phase. The value of k will be discussed in Section 5.9. For each document, the
textual content and the query are sent to a local server, which hosts a SentenceTransformer model:
all-roberta-large-v1 [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ], available on Hugging Face [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. The model is built upon the
RoBERTalarge architecture, a robustly optimized variant of BERT, and is fine-tuned on over one billion sentence
pairs to produce semantically meaningful embeddings. Given an input, the model maps it into a dense
1024-dimensional vector space such that semantically similar sentences are located close together. This
makes it well-suited for tasks like reranking, semantic search, and clustering.
          </p>
          <p>The server computes these steps:
• It employs the model, which performs a vector representation (embedding) of both the query and
the document and evaluates their similarity through the scalar product.
• The score obtained, between 0 and 1, reflects the semantic relevance between the two texts so it’s
weighted and added to the original Lucene score, in order to obtain a new ranking value.
• The documents are then reordered based on this aggregated score.</p>
          <p>This approach allows us to combine the eficiency of traditional retrieval with the generalization and
semantic understanding capabilities ofered by Deep Learning models.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Running the system</title>
        <p>Once our components have been developed, we implemented a main SystemRunner class to coordinate
the entire pipeline execution. The indexing and searching parts can be executed at diferent times, to
simulate the ofline and online deployment phases. The parameters of the system are passed to the
components through constructors and XML configuration files, in order to improve the flexibility of the
system.</p>
        <p>The final complete overview of the system is shown in Figure 1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>As already discussed, the system has been developed in Java through the Apache Lucene Library,
but a few side components have been developed in Python. The source code can be found at https:
//bitbucket.org/upd-dei-stud-prj/seupd2425-3ds2a/src/master/, and as an evaluation measure, we used
the metrics provided by the trec_eval 9.0.7 tool, available at the https://trec.nist.gov/trec_eval/
repository.</p>
      <p>The system has been run on the collection provided by the LongEval team at https://clef-longeval.
github.io/data/. We used our personal computers to test our system, where the most expensive runs
have been completed using the hardware described in Table 1.</p>
      <sec id="sec-4-1">
        <title>4.1. Training data and Evaluation</title>
        <p>The training collection is made up of several components. LongEval team has given us a timeline
sequence of documents, extracted from the Qwant search engine through a sequence of months, from
June 2022 to February 2023, with every month in a specific subfolder. For every monthly snapshot,
the corresponding list of queries to submit has been provided at https://github.com/clef-longeval/
clef-longeval.github.io/tree/master/collection. Moreover, to understand how the system performed each
month, a corresponding qrels file has been provided for each month.</p>
        <p>The evaluation metrics have been computed using the -m all_trec parameter of the trec_eval executable,
which will give us several metrics to examine. For training data, we mostly focused on the nDCG and
MAP metrics, in order to improve the overall result of the system.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>The first thing we need to do to improve our system is to set a reference score for our metrics. In order
to do that, we configured our system to use only the main components of Lucene. Upon establishing a
baseline evaluation score, we could then try to develop new strategies to improve the final result of the
system.</p>
      <sec id="sec-5-1">
        <title>5.1. Baseline performance using Lucene library</title>
        <p>To set a baseline for our next runs, we used a simple configuration of the Indexer, Analyzer, and Searcher
components, as described in Table 2.</p>
        <p>When running this base system on the provided training dataset, we achieved very diferent results
depending on the considered month. We repeated the measurements several times and came to the
conclusion that some months have very dificult topics or strange judgments, but probably also because
some documents have diferent languages and may not be processed in an eficient way. Nevertheless,
we achieve a complete overview of the baseline system’s performance in the Table 3.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Stemming against Lemmatizing</title>
        <p>
          Even if we couldn’t manage to achieve an eficient system using OpenNLP for lemmatizing the
documents, we tried a simpler way to generate a lemmatized version of the collection. Using a
dictionarybased lemmatizer [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], we generated a lemmatized copy of the collection, including the queries, and fed
them to our system, taking care of disabling any active stemmer. The results are depicted in Figure 2
and Figure 3. The comparison with the baseline system highlights how the performance is pretty much
aligned with the stemmed version of the collection, but it is important to say that the lemmatizer used
is probably too simple. Nowadays our systems could take more advantage of LLM models, and from a
better lemmatization it would be possible to further expand each term to improve the final result of the
system.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Choice of stemmer</title>
        <p>To assess the efectiveness of diferent text analysis strategies, we performed a series of experiments
combining various tokenizers (Standard, Letter, and Whitespace) with multiple stemmers (FrenchLight,
FrenchMinimal, Snowball, and None). All configurations were tested on the 2022-06 dataset.</p>
        <p>The evaluation focused on two main retrieval metrics: Mean Average Precision (MAP) and nDCG.
Across the board, the FrenchLight stemmer consistently delivered the best results, particularly when
paired with the StandardTokenizer, achieving the highest MAP (0.1174) and competitive nDCG scores.
Configurations using the LetterTokenizer also performed well, providing strong baseline alternatives.</p>
        <p>Interestingly, the use of stopword filtering did not lead to performance improvements. This trend
was consistent across both French and English stoplists, suggesting that aggressive stopword removal
may be useless in this context.</p>
        <p>Despite the fact that stopword filtering did not improve retrieval performance in our current
experimental, this feature will remain part of the default analysis pipeline. There are several reasons behind
this decision. First, stopword removal improves robustness by eliminating high-frequency, low-content
terms that might introduce noises. Second, stopwords can significantly afect index size and eficiency,
especially when dealing with large corpora. Lastly, including stopword filtering maintains compatibility
with standard IR practices and facilitates future integration of more advanced techniques, such as query
expansion or relevance feedback, which often employ a cleaner input signal. Some interesting results
can be seen in Table 4.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. BM25 Parameters Tuning</title>
        <p>To evaluate the impact of the parameters of the ranking function on retrieval performance, we
experimented with diferent configurations of the BM25 similarity function, which is the default scoring
method used in many modern IR systems.</p>
        <p>We evaluated multiple parameter settings of BM25, particularly focusing on variations of the
parameters 1 (term frequency saturation) and  (document length normalization). The baseline run
used the default Lucene settings, while alternative runs explored more aggressive and conservative
configurations.</p>
        <p>Among the settings tested, the default setting (with parameters 1 = 1.2 and  = 0.75) achieved the best
overall performance. This setup outperformed alternative configurations, including more aggressive
ones such as BM25(2.0, 1.0), which, although slightly better in recall (R@1000 = 0.4113), showed less
consistent results in early precision and overall MAP. In contrast, the more conservative configuration
BM25(0.9, 0.4) significantly underperformed in both early and deep precision. An additional run with
BM25(0.5, 0.0), which removed length normalization entirely, was tested but, as expected, showed the
worst scores across all metrics.</p>
        <p>These results suggest that the default BM25 function ofers the best trade-of between precision and
robustness for our dataset, and thus was selected as the standard configuration in our system.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Synonyms</title>
        <p>In this section, we are going to talk about the results achieved through synonyms query expansion. The
experiment setup was the one used in the baseline. The synonyms included in the queries are analyzed
with the same analyzer used for indexing the documents and, furthermore, are assigned a weight of 0.5
to reduce their significance compared to the words originally present in the queries.</p>
        <p>As shown in Table 6, query expansion with synonyms did not lead to any performance improvements.
In all months analyzed, the MAP and nDCG scores are slightly lower than those of the baseline. This
suggests that including synonyms, even with a reduced weight, tends to introduce noise rather than
add value to the queries. Based on these results, we decided not to adopt this strategy in future
configurations.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.6. Proximity Search</title>
        <p>Here we present the results obtained through the implementation of proximity search. As anticipated
before, the tuning of the slop parameter was a key point for obtaining the best results. Several test
runs were conducted to determine the optimal slop setting, and the results showed that a value of 50
consistently yielded the best performance. Therefore, this value was selected for the final configuration.
A possible explanation for this result is that smaller slop values may be too restrictive, missing relevant
documents where query terms appear slightly farther apart. On the other hand, larger values might
introduce too much noise, retrieving documents with weaker term associations. A slop of 50 appears to
strike the right balance between flexibility and precision, allowing for meaningful proximity without
overly relaxing the positional constraints. As discussed in subsubsection 3.4.5, we explored two
approaches: the first uses a PhraseQuery to match documents where all query terms appear in the same
order as in the original query; the second extends this by incorporating additional proximity constraints
based on all possible pairs and triplets of query terms, regardless of their order, using SpanNearQuery
with inOrder=false. As previously discussed, in the unordered case, we restricted the proximity
constraints to all possible pairs and triplets of query terms, rather than the full query, in order to avoid
exceeding the BooleanQuery clause limit.</p>
        <p>Month</p>
        <p>Table 7 compares the system performance when using only the full-query proximity constraints
(PhraseQuery) against the extended approach that also includes unordered pairs and triplets (PhraseQuery
+ SpanNearQuery). Firstly, it can be observed that the implementation of proximity search leads to a
consistent and substantial improvement in both MAP and nDCG scores compared to the baseline. In
particular, the maximum change is observed in January with an increase of 7.2% for the MAP and 4.15%
for the nDCG. Regarding the comparative performance of the two strategies, a trade-of emerges: relative
to the PhraseQuery-only approach, the extended method involving unordered pairs and triplets leads
to a slight decrease in MAP, while consistently improving nDCG (except for January). This suggests
that incorporating unordered term combinations may retrieve more relevant documents overall, even if
precision at the top ranks is marginally afected.</p>
      </sec>
      <sec id="sec-5-7">
        <title>5.7. Pseudo Relevance Feedback</title>
        <p>We tested two diferent configurations of Pseudo Relevance Feedback (PRF), with the aim of evaluating
the impact of diferent parameter settings. The first configuration, referred to as PRF1, is more
conservative and uses the following settings: "topDocsForPRF": 10, "topTermsToAdd": 3,
"originalBoost": 1.0, and "expansionBoost": 0.3. The second configuration, PRF2, is more
aggressive and relies on "topDocsForPRF": 20, "topTermsToAdd": 5, "originalBoost":
0.8, and "expansionBoost": 0.5.</p>
        <p>It is worth noting that PRF is typically more efective in scenarios where initial precision is high
(e.g., high P@5 or P@10), as it assumes that top-ranked documents are relevant and can guide useful
expansion. Therefore, evaluating PRF in isolation on a general baseline setup may not fully reflect its
potential. Nevertheless, this comparison provides useful insights for tuning the PRF strategy.</p>
        <p>As expected, both PRF configurations resulted in a decrease in nDCG when applied on top of the
baseline system. However, the comparison between PRF1 and PRF2 helps highlight the relative benefits
of a more conservative versus a more aggressive expansion approach.</p>
        <p>Later in this report, we will discuss how PRF interacts with other retrieval enhancements and whether
its impact changes when combined with more efective base configurations.</p>
      </sec>
      <sec id="sec-5-8">
        <title>5.8. Chunk Indexing</title>
        <p>To investigate whether finer-grained indexing could enhance retrieval efectiveness, we experimented
with a chunk indexing approach. The documents were split into overlapping chunks of 10 sentences
each, with a shared window of 3 sentences.</p>
        <p>As shown in Figure 5 and Figure 6, chunk indexing consistently slightly underperforms compared
to the baseline. Both MAP and nDCG values are lower throughout the time period, suggesting that
splitting documents into sentence-based overlapping chunks, in this configuration, degrades retrieval
performance.</p>
        <p>In this scenario, chunking probably introduces noise and dilutes term co-occurrence signals, making
it harder for the system to correctly prioritize relevant documents.</p>
      </sec>
      <sec id="sec-5-9">
        <title>5.9. Reranker</title>
        <p>
          As discussed in subsubsection 3.4.7, we need to determine a threshold for the number of documents to
be reranked. Based on the work by Basaglia et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], we observed that reranking the top 150 documents
provided a good balance between eficiency and efectiveness, so we selected this value. Below is a table
showing the performance achieved by our system each month, using reranking and proximity search
(approach with PhraseQuery + SpanNearQuery), which are techniques that improve our system.
        </p>
        <p>As shown, reranking leads to improvements in both MAP and nDCG scores over proximity search
(PhraseQuery + SpanNearQuery approach) results across all months. The greatest gains are observed in
November, with MAP increasing by 5.1% and nDCG by 2.7%.</p>
      </sec>
      <sec id="sec-5-10">
        <title>5.10. Combined Strategies</title>
        <p>In this section, we evaluate the performance of various combined approaches, all built on top of the
Proximity Search method. Given the consistently strong results achieved by PS alone, it serves as the
Month
foundation for all hybrid strategies evaluated here.</p>
        <p>We consider combinations such as PS with Chunk Indexing (PS_CI), PS with Pseudo Relevance
Feedback (PS_PRF1), and the full combination PS_CI_PRF1. The goal was to assess whether the
integration of multiple retrieval enhancements could provide cumulative improvements.</p>
        <p>As the figures above show, none of the combined strategies significantly outperforms the standalone
PS method. In some cases, such as PS_CI_PRF1, the addition of other techniques even slightly degrades
performance. This is particularly evident in the MAP plots, where more complex configurations tend to
underperform compared to PS alone.</p>
        <p>One possible explanation is that Pseudo Relevance Feedback (PRF1) may not yet be efective in
this context, possibly due to insuficient initial precision. PS alone may not reach high enough early
precision to make expansion terms reliably informative. Alternatively, the PRF configuration may
require diferent fine-tuning when applied on top of PS.</p>
        <p>Despite this, the results demonstrate the robustness of the PS method. Even when combined with less
efective strategies, PS manages to retain a high level of performance across time. This suggests that PS,
as implemented, remains the most reliable and efective enhancement among those tested in this study.</p>
      </sec>
      <sec id="sec-5-11">
        <title>5.11. Training results</title>
        <p>In this section, we compare the best-performing configurations across various implemented techniques.
These configurations are the runs we have selected for submission to the LongEval Conference:
• The best system that doesn’t use reranking.
• The best overall system.
• A system using only chunk indexing.
• A system using pseudo-relevance feedback and proximity search.</p>
        <p>• A system using proximity search, chunk indexing and pseudo-relevance feedback.</p>
        <p>Each of the systems will use the configuration of the baseline, that we recall being the one reported
in Table 2. The proximity search approach is the one considering also pairs and triples
(PhraseQuerySpanNearQuery)</p>
        <p>Table 9 presents, for each system configuration, the best result achieved across all months, which
corresponds to the score of January.</p>
        <p>From the results shown in Table 9, we can draw several key insights. First, proximity search alone
(System 3) already provides a significant improvement over the baseline, even without reranking.
However, the best overall performance is achieved by combining proximity search with reranking
(System 5), which outperforms all other configurations in both MAP and nDCG, highlighting the
importance of reranking in enhancing retrieval quality. Interestingly, the combination of proximity
search, chunk indexing, and pseudo-relevance feedback (System 4) does not surpass the performance of
simpler approaches, suggesting that the integration of multiple techniques does not necessarily lead
to additive gains. Chunk indexing alone (System 2) shows a loss in performance, especially in terms
of nDCG. These findings support the submission of System 3 as the best non-reranking system and
System 5 as the most efective overall configuration.</p>
        <p>The interpolated Precision-Recall curves shown in Figure 11 confirm the results observed in the
tabular metrics. System 5 consistently demonstrates the highest precision across nearly all levels of
recall, reinforcing its position as the overall best-performing configuration, closely followed by System
3. In contrast, System 4 clearly underperforms relative to the other systems, suggesting that combining
multiple techniques—as done in this configuration—may introduce additional complexity without
yielding significant performance gains. Notably, Systems 3 and 5 exhibit remarkably similar trends,
with System 3 performing slightly worse; this may be attributed to their shared reliance on proximity
search technique. Another noteworthy observation is that, within the recall interval approximately
between 0.55 and 0.75, Systems 1, 2, and 4 outperform the proximity search-based systems.</p>
      </sec>
      <sec id="sec-5-12">
        <title>5.12. Training results on updated training dataset</title>
        <p>We conducted experiments also on the updated version of the training queries. Since all previously
evaluated systems yielded the same qualitative conclusions despite improvements in their metric values,
we have omitted their full result tables to avoid verbosity and redundancy. Instead, we report only the
proximity search technique that yielded qualitatively diferent results compared to the first version
training set.
Month</p>
        <p>Table 10 reveals a notable shift from our initial findings (Table 7): the variant with only PhraseQuery
now outperforms the combined PhraseQuery + SpanNearQuery approach. Consequently, we will
hereafter employ proximity search exclusively with PhraseQuery. Furthermore, by comparing results
obtained using the updated training set (Table 10) with the first training set’s results (Table 7) we can
also notice a general boost of performances in all months. The performance trend of our system is
consistent with the original training set obtaining also with this updated version the best performances
in January.</p>
      </sec>
      <sec id="sec-5-13">
        <title>5.13. Test results</title>
        <p>CLEF also released a test set consisting of documents spanning from March 2023 to August 2023, again
organized on a monthly basis, each accompanied by its corresponding set of qrels. To evaluate the
efectiveness of the submitted systems, we computed the nDCG and MAP metrics on the test collections.
In order to avoid overly verbose and redundant reporting, we selected a representative subset of months,
each capturing a diferent stage in the temporal evolution of the test set. This strategy allows for
a clearer analysis of how each IR system performs and adapts over time—an essential aspect of the
LongEval task, which aims to identify systems capable of maintaining stable performance in the face of
temporal drift.</p>
        <p>The analysis was carried out for each of the five submitted systems across three temporal segments:
the short-term test collection (March 2023), the mid-term collection (June 2023), and the long-term
collection (August 2023). Table 11 presents the performance of the systems among March, June, and
August.</p>
        <p>The table also reports the performance of the presented baseline approach on the test dataset. As we
can see not every developed system improves the scores across months: Systems 1, 2, and 4 have a drop
of 2%-5% in the nDCG score, and up to 10% in the MAP score. On the other side, the best-performing
systems (Systems 3 and 4) show some relevant improvements: the System 3 increases the performance
of the nDCG metric from 3% to 6%, and from 6% to 9% for the MAP, while the System 5 increases the
performances in nDCG scores from 5% to 8% and in MAP score from 8% to 9%, and thus proving the
great efectiveness in the use of reranking systems and query manipulation techniques, with respect to a
more traditional system. Moreover the analysis of nDCG and MAP scores reveals a clear pattern: all five
systems improve from March to a peak in June 2023, but then undergo a steady performance decline by
August 2023. Between March and August, nDCG drops by approximately 17% – 19% and MAP by 14% –
18% across systems, indicating susceptibility to temporal drift. Notably, System 5—despite maintaining
the largest lead—exhibits the greatest sensitivity to drift (-18.7% nDCG), whereas the worst system,
System 1, shows the smallest decline (-16.9% nDCG). Another interesting fact is that the spread between
the best and worst systems shrinks from 5.8 points in March to 3.95 points in August, indicating a sort
of convergence of system eficacy under temporal drift.</p>
        <p>From March to June, all systems register a clear uplift, but the magnitude varies. Systems 1 and 2 lead
the pack with nDCG increases of approximately 7.0% each (MAP gains of 11.4% and 10.2%, respectively),
whereas the more complex Systems 3 and 5 improve by only 5.5% and 5.3% in nDCG (MAP gains of
7.3% and 7.4%). This suggests that simpler indexing strategies may adapt more quickly to fresh data.</p>
        <p>Overall, these findings suggest that although reranking and proximity search enhancements yield the
strongest results, they remain vulnerable to evolving document distributions, underscoring the need for
IR models that are more robust to temporal change.</p>
        <p>In addition, statistical hypothesis tests were performed to assess whether the performance diferences
among the systems are statistically significant. This analysis is crucial to determine whether certain
configurations genuinely contribute to performance improvements or if the observed gains are merely
the result of test variance. In the following section, we provide a detailed explanation of the SHT
methodology and present the results for both the evaluation metrics previously discussed.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Statistical Hypothesis Testing</title>
      <p>Statistical Hypothesis Testing (SHT) provides a mathematical framework to draw statistical inferences
from data. It involves comparing a null hypothesis 0 against an alternative hypothesis 1. The result
is considered statistically significant if the observed data are unlikely to have occurred under the null
hypothesis, based on a predefined threshold  , known as the significance level. If this condition is met,
the null hypothesis is rejected; otherwise, we fail to reject it.</p>
      <p>In this study, we apply Two-Way ANalysis Of VAriance (ANOVA2) as a statistical test: it examines the
influence of two diferent variables, which in our case are the systems and the topics used. ANOVA2 is
used to evaluate the diference between the means of more than two groups, which fits our necessity to
compare five diferent IR systems. In ANOVA2, the hypotheses are as follows:
• 0 - the means of all groups are equal;
• 1 - at least 2 groups have diferent means.</p>
      <p>Whenever the ANOVA test reveals statistically significant diferences (i.e., 0 is rejected), we further
conduct a Tukey’s Honestly Significant Diference (HSD) test as a post-hoc analysis. This allows pairwise
comparisons between group means to identify which diferences are statistically meaningful. For all
tests we adopt a significance level  = 0.05.</p>
      <p>In this section we report the analysis conducted on the test collection.</p>
      <sec id="sec-6-1">
        <title>6.1. Short-term</title>
        <p>To visually complement the numerical findings and support the subsequent statistical testing, in
Figure 12 we report the box plots of the Average Precision (AP) and nDCG scores for each system on the
short-term collection. These graphical representations provide insights into the distribution, variability,
and central tendencies of system performance.</p>
        <p>As observed in the AP box plot, all systems exhibit very similar distributions, with nearly identical
interquartile ranges and whiskers, suggesting comparable variability. In all systems, the mean (green
dotted line) is higher than the median (orange line), indicating slight right-skewness—where a few
high-performing queries raise the average. All systems reach the maximum AP value of 1.0 and a
minimum close to 0.0, revealing the presence of both very successful and very weak queries. The
absence of outliers further suggests consistent performance across queries.</p>
        <p>Similarly, the nDCG box plot confirms these trends: the five systems nearly overlapping distributions
with consistent whisker lengths and interquartile ranges. The means exceed the medians for all systems,
indicating a right-skewed distribution here as well.</p>
        <p>These findings highlight that, despite minor diferences, the systems behave similarly on the
shortterm collection—reinforcing the need for formal hypothesis testing to establish whether observed
diferences are statistically significant.</p>
        <p>Table 12 and Table 13 report the results of the ANOVA2 test conducted on the short-term collection
for MAP and nDCG, respectively.</p>
        <p>The sum of squares (SS) column quantifies the total variation attributed to each source—either
between systems (Columns), between topics (Rows), or residual error. The degrees of freedom (df)
indicates how many independent components contribute to each source’s variability.</p>
        <p>The mean square (MS) is obtained by dividing SS by the corresponding degrees of freedom and
reflects the variance contribution of each source. The F column reports the F-statistic, which tests
whether the observed variability across systems or topics is significantly greater than what would be
expected by chance. Finally, the Prob&gt;F column shows the p-value associated with each F-statistic.</p>
        <p>In both tests, the p-value for the Columns source is well below the 0.05 significance threshold,
providing overwhelming evidence that the five systems do not perform equally. The Rows component,
which reflects topic variability, is also highly significant, indicating that the choice of topic greatly
impacts system performance. This suggests that performance varies not only across systems but also
(b) nDCG values
across queries, underscoring the importance of robust performance across diverse topics. Thus, we
reject the null hypothesis 0.</p>
        <p>After discovering that the 5 systems are diferent, it is useful to compare systems pairwise in order
to understand where the diference comes from. For this purpose, as said before, we will employ
Tukey’s Honestly Significant Diference test, which is used in order to complement the statistical results
by illustrating the pairwise comparisons between systems, highlighting which diferences in mean
performance are statistically significant. Figure 13 shows a comparison among the mean of the diferent
groups. System 1 has been selected as the reference group, as indicated by the vertical dotted line
corresponding to its mean value. The plots show the mean performance of each system along with
their confidence intervals, both for AP and nDCG. We can see that four out of the five systems exhibit
statistically significant diferences in their mean values when compared to System 1, both for AP and
nDCG. Specifically, the confidence intervals for Systems 2, 3, 4 and 5 do not overlap with that of System
1, confirming that the diferences are not due to chance at the selected significance level  = 0.05. The
output of the test is reported in Table 14 and Table 15 for AP and nDCG, respectively. The p-values
lower than 0.05 are shown in bold, meaning that we reject the null hypothesis 0.</p>
        <p>(a) AP values
(b) nDCG values</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Mid-term</title>
        <p>This section considers the mid-term collection. Figure 14 reports the box plots of AP and nDCG for
each system on the mid-term collection.</p>
        <p>The considerations for the mid-term collection are largely in line with those made for the short-term.
As in the previous case, all systems exhibit similar distribution shapes, with comparable interquartile
ranges and whisker lengths, and full coverage of the score range from 0.0 to 1.0. In both AP and nDCG
plots, the mean values consistently exceed the medians, indicating a right-skewed distribution due to a
subset of high-performing queries.</p>
        <p>A notable diference, however, is that System 3 and System 5 exhibit slightly higher central values,
particularly in the nDCG plot. This hints at a marginally stronger performance for these two systems
on the mid-term collection, even though the overall variability remains similar across all systems.</p>
        <p>As done for the short-term collection, we report the tables with the results of the ANOVA2 tests
conducted on AP and nDCG. In this case as well, both p-values associated with the systems and the
queries fall below the significance threshold  . Consequently, we reject the null hypothesis and proceed
with the Tukey’s HSD test, reporting the plots obtained.</p>
        <p>In the mid-term scenario (Figure 15), we observe a pattern of diferences among the systems that
closely mirrors what emerged in the short-term evaluation. Here, System 1 has been chosen as the
(b) nDCG values
reference group as well. As before, each system’s mean performance is plotted with its confidence
interval, allowing us to judge at a glance which diferences are genuine and which might arise by
chance. Just as in the short-term case, four out of the five systems difer significantly from the reference
at  = 0.05 : none of the confidence intervals for Systems 2, 3, 4 or 5 overlap with that of System 1, so
we can reject the null hypothesis of equal means for all those pairwise comparisons.</p>
        <p>As before, we report the output of the test in Table 18 and Table 19 for AP and nDCG, respectively.</p>
        <p>System 2
System 3
System 4
System 5
System 3
System 4
System 5
System 4
System 5
System 5</p>
        <p>P-value
(b) nDCG values
Lastly, we discuss the long-term collection. In Figure 16 a marked diference is observed in the medians,
with System 1 and System 4 clearly underperforming relative to the others. Their boxes are more
compressed towards the bottom of the scale, indicating consistently low performance across topics.
Moreover, in Figure 16a System 1 exhibits a greater number of outliers, suggesting instability or
sporadic good performance on a few topics, but generally poor results overall. System 2 also shows
worse performance compared to the other systems, though not as poor as System 1 and System 4.</p>
        <p>(a) AP values
(b) nDCG values</p>
        <p>With respect to ANOVA2 results in Table 20 and Table 21, we can draw analogous conclusions as for
the short-term and mid-term collections: both factors exhibit extremely low p-values, indicating that
diferences among retrieval systems and the inherent variability across queries are highly significant.
Consequently, we apply Tukey’s Honestly Significant Diference post-hoc test.
df</p>
        <p>MS</p>
        <p>F</p>
        <p>For the long-term collection, the results of Tukey’s HSD test (Figure 17) reveal patterns consistent
with those observed in the short-term and mid-term evaluations. System 1 is again used as the reference
group. In the AP comparison (Figure 17a), the confidence intervals for Systems 2, 3, 4, and 5 do not
overlap with that of System 1, indicating statistically significant diferences in mean performance. This
confirms that, as in the previous evaluations, System 1 performs significantly diferently from all other
systems at the  = 0.05 significance level. In contrast, for the nDCG metric (Figure 17b), only Systems
2, 3, and 5 exhibit confidence intervals that do not intersect with that of System 1, suggesting significant</p>
        <p>MS
diferences in these cases. System 4, on the other hand, does not show a statistically significant diference
from System 1 in terms of nDCG, as also confirmed by the corresponding p-value reported in Table 23,
which exceeds the significance threshold of  = 0.05.</p>
        <p>(a) AP values
(b) nDCG values</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Work</title>
      <p>This section summarizes the diferent techniques that we explored in order to improve our Information
Retrieval system. Starting from a basic configuration made by using Lucene’s API, we tested its
performance to set a reference score, and we discovered how diferent data can greatly change the
performance of our system. We explored several possible configurations for our analyzer, including
a lemmatization step of the entire collection, and we discovered that using a light stemming process
produces a higher score than using more complex stemmers. We then moved on with testing diferent
approaches to improve the matching and the ranking of the documents: we first introduced a diferent
way to index the documents, by splitting them into smaller pieces and scoring each piece individually,
but we soon discovered that this approach leads our system to underperform with respect to the fixed
baseline.</p>
      <p>We later moved on into expanding the query terms using synonyms, which did not lead to an
improvement of the system, to the releasing of terms’ positions in the proximity search approach,
which instead proved to increase the efectiveness of the system, especially when fine-tuning the slop
parameter. We also explored the possibility of using the top-most retrieved documents to improve the
query itself through the pseudo-relevance feedback technique, especially useful when precision is high
at the top-most ranking positions. Finally, we used sentence embeddings to rerank the results returned
by our system through the Roberta-Large model, which again leads us to a better improvement of our
system’s performance.</p>
      <p>When all these techniques are deployed individually, it may appear dificult to understand if the
system is improving, however combining them all appears not to be the most efective way to increase
the performance of the system, which is also very sensitive to data that are fed as input. In order to
continue improving the eficiency of an IR system, more research is needed. With regard to synonyms
and query expansion, the most intuitive way to improve the query is to use a language model to
learn the synonyms directly by the collection, instead of relying on external datasets. As regards
the lemmatization process, it may be useful to construct a dictionary of arguments and use the same
dictionary to tag each provided query with a more general word and use those to improve the pool of
matched documents. Another possible improvement of the system could be made by joining together
similar queries, so that the user may find more documents related to the possibly imprecise query he
submitted. Once again, the huge development of LLMs that we are seeing nowadays can be a great tool
to improve the efectiveness of those IR systems, especially when the data collections are mostly made
up of textual documents.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, in order to: Grammar and spelling check,
Paraphrase and reword. After using this tool/service, the authors reviewed and edited the content as needed
and take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cancellieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Ebshihy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez-Saez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iommi</surname>
          </string-name>
          , J. Keller, P. Knoth,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pride</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schaer</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2025 LongEval Lab on Longitudinal Evaluation of Model Performance</article-title>
          , in: J.
          <string-name>
            <surname>Carrillo-de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Sushilkumar</given-names>
            <surname>Chavhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raghuwanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dharmik</surname>
          </string-name>
          ,
          <article-title>Information Retrieval using Machine learning for</article-title>
          <source>Ranking: A Review</source>
          , https:// iopscience.iop.org/ article/ 10.1088/
          <fpage>1742</fpage>
          -
          <lpage>6596</lpage>
          /
          <year>1913</year>
          / 1/ 012150/ meta,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          <article-title>, Multi-stage document ranking with bert</article-title>
          ,
          <year>2019</year>
          . URL: https: // arxiv.org/ abs/
          <year>1910</year>
          .14424. arXiv:
          <year>1910</year>
          .14424.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Basaglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stocco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <article-title>Seupd@clef:team dam on reranking using sentence embedders</article-title>
          ,
          <year>2024</year>
          . URL: https:// ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          / paper-216.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pramana</surname>
          </string-name>
          , Debora,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Subroto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A. S.</given-names>
            <surname>Gunawan</surname>
          </string-name>
          , Anderies,
          <article-title>Systematic literature review of stemming and lemmatization performance for sentence similarity</article-title>
          ,
          <source>in: 2022 IEEE 7th International Conference on Information Technology and Digital Applications (ICITDA)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          . 1109/icitda55840.
          <year>2022</year>
          .
          <volume>9971451</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Balakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lloyd-Yemoh</surname>
          </string-name>
          ,
          <article-title>Stemming and lemmatization: A comparison of retrieval performances</article-title>
          ,
          <year>2014</year>
          . URL: https:// api.semanticscholar.org/ CorpusID:52998253.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Abedini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Akysh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fahoud</surname>
          </string-name>
          , Seupd@clef:
          <article-title>Team kalu on improving search engine performance with query expansion and re-ranking approach</article-title>
          ,
          <year>2024</year>
          . URL: https:// ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          / paper-214.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I. A. H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Moncada-Ramírez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Santini</surname>
          </string-name>
          , G. Zago,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          , Seupd@clef:
          <article-title>Team jihuming on enhancing search engine performance with character n-grams, query expansion, and named entity recognition</article-title>
          ,
          <year>2023</year>
          . URL: https:// ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3497</volume>
          / paper-185.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>The</given-names>
            <surname>Apache Software Foundation</surname>
          </string-name>
          , Apache Lucene, https:// lucene.apache.org/ ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>The</given-names>
            <surname>Apache Software Foundation</surname>
          </string-name>
          , Apache OpenNLP, https:// opennlp.apache.org/ ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          ,
          <article-title>The leff, a freely available and large-coverage morphological and syntactic lexicon for french, 7th international conference on Language Resources and Evaluation (LREC</article-title>
          <year>2010</year>
          )
          <article-title>(</article-title>
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Coulomb</surname>
          </string-name>
          ,
          <string-name>
            <surname>Claude</surname>
            ,
            <given-names>A French</given-names>
          </string-name>
          <article-title>Lemmatizer in Python based on the LEFFF</article-title>
          , https:// github.com/ ClaudeCoulombe/ FrenchLeffLemmatizer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yin</surname>
          </string-name>
          , H. Schütze,
          <article-title>MultiGranCNN: An architecture for general matching of text chunks on multiple levels of granularity</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
          </string-name>
          , M. Strube (Eds.),
          <source>Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Beijing, China,
          <year>2015</year>
          , pp.
          <fpage>63</fpage>
          -
          <lpage>73</lpage>
          . URL: https:// aclanthology.org/ P15-1007/ . doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>P15</fpage>
          -1007.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fišer</surname>
          </string-name>
          ,
          <article-title>Building a free french wordnet from multilingual resources</article-title>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          ,
          <year>2019</year>
          . URL: https:// aclanthology.org/ D19-1410.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Gurevych, sentence-transformers/all-roberta-large-v1, https:// huggingface.co/ sentence-transformers/ all-roberta-large-</article-title>
          <string-name>
            <surname>v1</surname>
          </string-name>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>