<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SARD at LongEval: Longitudinal Evaluation of IR Systems by Using Query Rewriting and Hybrid Queries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Damiano Caon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Dal Maschio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Disarò</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sofia Maule</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the participation of Team SARD in the Conference and Labs of the Evaluation Forum (CLEF) LongEval 2025 shared task, which investigates the longitudinal evaluation of information retrieval systems on evolving web collections. After a careful analysis on the dataset given to train the system, the group decided to first try some common techniques to improve performance and then focus on those that produced better experimental results. In particular, we explore a wide range of indexing and querying configurations by varying analyzers, language detection granularity, and query formulation strategies using the tools provided by the Apache Lucene framework. Additionally, we examine the impact of synonym-based query expansion using external lexical resources and query rewriting to correct typing errors. The evaluation is based on standard retrieval metrics: MAP, nDCG, and interpolated precision at standard recall levels, all computed using trec_eval. A custom MATLAB-based pipeline was developed to automate metric extraction, aggregation, and visualization across monthly snapshots. Our results indicate that document-level language detection and a combination of boolean and phrase queries improve performance in most scenarios. Conversely, synonym-based query expansion often degraded performance. In contrast, LLM-based query rewriting to fix user input errors led to noticeable improvements, demonstrating how even small interventions can enhance retrieval quality. We also observe high variability across queries and reflect on the implications for evaluation reliability. Future work will explore learning-to-rank approaches and the integration of large language models, contingent on the availability of higher-quality computational resources and adequate relevance judgments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Retrieval</kwd>
        <kwd>LongEval</kwd>
        <kwd>CLEF</kwd>
        <kwd>Query Rewriting</kwd>
        <kwd>Boolean Query</kwd>
        <kwd>Phrase Query</kwd>
        <kwd>Query Manipulation</kwd>
        <kwd>trec_eval</kwd>
        <kwd>MAP</kwd>
        <kwd>nDCG</kwd>
        <kwd>Interpolated precision at standard recall</kwd>
        <kwd>Language Detection</kwd>
        <kwd>Lucene</kwd>
        <kwd>Retrieval Performance</kwd>
        <kwd>Analyzer Comparison</kwd>
        <kwd>Longitudinal Evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Information Retrieval (IR) systems are often trained on static datasets, but real-world data, especially on
the Web, evolves continuously over time. As a result, their efectiveness tends to degrade when applied
to new, unseen data. This issue is at the core of CLEF 2025 LongEval Task 1 (Web Retrieval)[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which
focuses on evaluating the robustness of IR models over time.
      </p>
      <p>
        This lab provides an opportunity to evaluate retrieval systems on a temporally evolving collection with
shared tasks and relevance judgments [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The goal is to investigate how systems behave in longitudinal
settings, where both the data and the information needs change over time.
      </p>
      <p>The aim of this report is to present our solution to this challenge. Our approach was to design an IR
system capable of maintaining stable performance across temporal shifts. In particular, we focused
on the development of custom language-specific analyzers for the preprocessing of documents and
queries, and also the correction of user queries using GPT-4 Turbo. Unlike traditional query expansion
methods, which add semantically related terms, our use of GPT-4 Turbo aims to rewrite and clarify
ill-formed or noisy queries without altering their intent, resulting in more precise and efective matching
with the indexed content.</p>
      <p>The paper is organized as follows: Section 3 describes our approach; Section 4 explains our experimental
setup; Section 5 discusses our main findings; and Section 6 draws some conclusions and outlines
directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        To reduce the total time required to index all documents, we took inspiration from the work of a previous
team participating in the LongEval challenge, specifically Team GWCA [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. While their implementation
leveraged multithreading for concurrent processing, we extended this idea by parallelizing not only the
indexing tasks but also the document loading phase. In particular, we grouped input files into batches
and assigned them to threads, rather than processing them one at a time.
      </p>
      <p>
        Furthermore, to ground our methodology in prior experience, we consulted the LongEval notebook
papers from the two previous years [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These provided insight into the range of approaches
attempted in earlier editions of the task and highlighted techniques that yielded competitive results.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The methodology employed in this work follows the principles of iterative design. An initial baseline
pipeline was defined at the beginning of the system development. As the work progressed, this structure
was continuously refined through a cycle of implementation, testing, and improvement.
These iterations were not carried out blindly or based solely on trial and error; rather, they were guided
by reasoned decisions aligned with established theoretical findings in the field of Information Retrieval.
This approach enabled the progressive enhancement of the system based on both practical outcomes
and theoretical grounding.</p>
      <p>The architecture of the system, which also served as the foundational pipeline, was composed of
the following key components:
• Parser – responsible for reading and structuring the raw input data;
• Analyzer – performed preprocessing tasks such as tokenization, stop-words removal, and
stemming;
• Indexer – constructed and maintained the index used for information retrieval;
• Searcher – executed queries against the index and returned ranked results.</p>
      <p>In order to manage the execution of the pipeline, a lightweight Main class was developed. This class
was designed to maximize operational flexibility, allowing the user to control the system with the
greatest degree of freedom possible. For instance, it is possible to select and process only a subset of
months by modifying the initial month list, or to choose whether to perform indexing, searching, or
both, by simply adjusting internal configuration flags. Additionally, users can easily set parameters
for the search process, such as the number of documents to retrieve and the type of query to execute
— either boolean or phrase queries. This design choice proved particularly valuable during system
evaluation, as it avoided redundant indexing, a computationally expensive step, and enabled rapid
testing of alternative configurations on previously built indexes.</p>
      <p>The previously listed modules were implemented as standalone packages capable of interacting
seamlessly with one another. Each component was designed with modularity in mind, allowing for flexible
configuration and easy parameter adjustments. This architectural choice facilitated experimentation
with diferent setups during the evaluation phase, supporting the iterative and theoretically-informed
development process.</p>
      <sec id="sec-3-1">
        <title>3.1. Parser</title>
        <p>The Parser is responsible for reading the raw JSON files and transforming them into structured
documents suitable for subsequent analysis and indexing. Although the component is relatively
straightforward, particular care was taken to ensure robustness and eficiency during the design and implementation
phase.</p>
        <p>
          To process potentially very large datasets without running into memory limitations, the Parser was
implemented using a streaming approach. Specifically, it employs a tokenized JSON reader (based
on the Jackson library [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]) that reads the documents sequentially from the file. This design allowed
the system to handle thousands of documents eficiently, without loading the entire dataset into memory.
During the parsing process, each document is cleaned to reduce textual noise and standardize the
content before further processing. The following preprocessing steps are applied to the textual contents:
• Removal of emojis;
• Removal of URLs;
• Removal of HTML tags;
• Conversion to lowercase.
        </p>
        <p>These cleaning operations were motivated by the need to simplify the input text and enhance the
efectiveness of downstream components such as tokenization and indexing. In particular, removing
highly variable or non-textual elements such as emojis, URLs, and HTML markup helped improve the
consistency of the documents without sacrificing meaningful content.</p>
        <p>Additionally, the Parser provides configurable options to selectively enable or disable the removal
of emojis, URLs, and HTML tags. This feature was designed to ofer flexibility during experimentation
phases, even though in the final pipeline all cleaning steps were consistently applied.
To ensure robustness, the Parser incorporates a safety check that automatically skips invalid
documents, defined as records missing either the identifier or the contents field. This prevented pipeline
failures and minimized the impact of noise or inconsistencies in the raw data.</p>
        <p>The documents produced by the Parser are instances of the JsonDocument class, a lightweight data
model encapsulating the id and contents fields of each document. For consistency across modules,
the field names ( id and contents) are defined as constants within the JsonDocument.FIELDS inner
class, ensuring compatibility with components such as the Analyzer and the Lucene-based indexing
system.</p>
        <p>Overall, the Parser contributed to establishing a clean and reliable document stream, serving as a
solid foundation for the subsequent stages of the information retrieval pipeline.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Analyzer</title>
        <p>To accommodate the multilingual nature of the dataset and support rapid experimentation, we developed
a flexible analyzer architecture. The process began with an exploratory component ( TestAnalyzer)
and led to the definition of two final language-specific analyzers tailored to English and French
documents.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. TestAnalyzer</title>
          <p>Initially, due to the presence of documents written in both English and French, we decided to build two
separate analyzers—one for each language—so that the most appropriate processing could be applied
based on the document’s language. To support rapid experimentation with diferent configurations, we
developed the TestAnalyzer class.</p>
          <p>This component is designed to be flexible and fully configurable through external JSON files, which
define all processing parameters without requiring any code modifications.</p>
          <p>Each configuration file specifies the following parameters:
• tokenizerType: the type of tokenizer to use;
• minLength - maxLength: optional token length boundaries used to filter out tokens that are
too short or too long. We tried using the range 0-100 to disable length filtering, and ranges such
as 2-25 or 3-20 to restrict tokens to more meaningful lengths.
• useEnglishPossessiveFilter: whether to apply the English possessive filter;
• stopListFilePath: path to the custom stopword list file;
• stemFilterType: the type of stemming filter to apply;
• language: the language of the input text;
The language parameter also enables language-specific processing rules, such as the handling of
contractions in French or possessive forms in English, and selects the most appropriate stemming
algorithm.</p>
          <p>The TestAnalyzer played a central role during the experimental phase, allowing us to evaluate various
combinations of tokenization, stopword removal, and stemming for both English and French collections.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. FrenchTextAnalyzer and EnglishTextAnalyzer</title>
          <p>We selected the best-performing setup and formalized it into two dedicated analyzers:
FrenchTextAnalyzer and EnglishTextAnalyzer. Each analyzer was designed to suit the linguistic
characteristics of its respective language, while sharing a common processing pipeline structure.
• Tokenization: Both analyzers begin by breaking input text into tokens using Lucene’s</p>
          <p>StandardTokenizer, which handles basic word segmentation.
• Normalization: A series of filters are then applied to standardize token formats:
– LowerCaseFilter converts all characters to lowercase.
– ICUFoldingFilter removes accents and diacritical marks to harmonize character
encodings.
– PatternReplaceFilter eliminates punctuation using a regular expression.
– LengthFilter retains only tokens between 2 and 20 characters, removing overly short or
long terms.
• Stopword Removal: Language-specific stopword lists are applied using StopFilter. We used
two separate lists: french_stopwords.txt and english_stopwords.txt, both taken from
Oracle’s default stoplists1.
• Stemming and Language-Specific Filters:
– The FrenchTextAnalyzer includes an ElisionFilter to remove contractions like “l’”,
“d’”, and “c’”, which are common in French. For stemming, it uses
FrenchMinimalStemFilter, which we found to perform best by balancing morphological
reduction with semantic retention. We also tested FrenchLightStemFilter and the
SnowballFilter with FrenchStemmer, but they were less efective in preserving useful
lexical distinctions.
– The EnglishTextAnalyzer applies an EnglishPossessiveFilter to strip possessive
sufixes (e.g., “’s”), reducing noise from possessive forms. It then uses
EnglishMinimalStemFilter, which outperformed alternatives like
PorterStemFilter and the SnowballFilter with EnglishStemmer by avoiding
overstemming and maintaining semantic clarity.</p>
          <p>Overall, this dual-analyzer setup allowed us to tailor preprocessing to the specific needs of each language,
improving indexing consistency and retrieval efectiveness.
1https://docs.oracle.com/en/database/oracle/oracle-database/21/ccref/oracle-text-supplied-stoplists.html#
GUID-5DAF7499-5EBA-41E6-AF1A-C3BD2C08F88F</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Indexer</title>
        <p>The Indexer is responsible for creating the index used for document retrieval. In the final versions of
our systems, each Lucene Document in the index contains five fields: the document ID, the content (if
in English), the content (if in French), the language of the document, and the name of the JSON file from
the collection where the original document is located. Having two separate content fields allowed us to
apply diferent analyzers to each, efectively handling both languages. Although it is a fairly standard
component, with limited impact on the overall efectiveness of the system, we focused on improving its
performance to facilitate the system testing phase. In particular, we concentrated our eforts on two
key aspects: multithreading and eficient language detection.</p>
        <p>
          To reduce the total time required to index all documents, and inspired by the work of a previous
team participating in the LongEval challenge [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we designed the Indexer to leverage multiple
processors using a multithreaded approach. This optimization resulted in a 50% reduction in indexing time
compared to the single-threaded implementation.
        </p>
        <p>
          Since the Indexer is also responsible for selecting the appropriate Analyzer based on the document’s
language, we tried to make the language detection process both fast and reliable. Initially, we considered
using the Lingua library [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], a widely adopted Java tool for language detection, but we observed that it
introduced a significant time overhead. Consequently, we implemented a custom language detection
class using a hybrid approach, analyzing only the beginning of each document to minimize processing
time. The first version of this class was developed to recognize the language of a sentence, but then it
was modified to work with full documents. The steps of our language detection process are as follows:
• If the document contains enough words, we use two predefined stopword lists (one for English
and one for French) to estimate the language, counting the occurrences of stopwords from each
list;
• If the resulting confidence score, i.e. the diference in the counts divided by the total number of
stopword occurrences, exceeds a predefined threshold, the detected language is returned;
• If confidence is too low or the text is too short, we fall back to the Lingua library, restricting
detection to English and French;
• If the result is UNKNOWN (e.g., if the text is in a diferent language), we default to French.
After experimenting with various parameter settings, we selected the following configuration:
• A minimum of 10 words to attempt detection using the stopword approach;
• A maximum of 500 words to process per document;
• A confidence threshold of 0.2 for the stopword-based detection.
        </p>
        <p>This configuration was tested on approximately 46000 documents, corresponding to the first three
JSON files from the September 2022 snapshot. The results showed that in roughly 99% of the cases, the
language detected by our hybrid method matched the one identified by Lingua when applied to the full
text. Furthermore, a manual review of the mismatches revealed that in almost 60% of the cases, the
discrepancy was not actually a misclassification, as the texts were multilingual or written in a diferent
language altogether. The manual review also helped us better understand common sources of detection
inconsistency, such as:
• Mixed French and English texts (e.g., doc141);
• Lists from e-commerce sites or product catalogs (e.g., doc1354517);
• Documents written in various languages and sometimes diferent alphabets (e.g., doc258);
• Cookie banners or browser-related boilerplate (e.g., doc3760).</p>
        <p>Despite a slight loss in accuracy compared to the full-text Lingua approach, our hybrid method achieved
a 95.6% speedup in language detection, leading to a significant overall reduction in indexing time. It is
worth mentioning that the language detection class also contains a method to recognize the language of
single words by checking two large vocabularies (one for French and one for English), with a fallback
on Lingua if the term is not found there. However, this method proved unreliable, often failing with
complex or uncommon terms, and was therefore not used in the final systems.</p>
        <p>Finally, by inspecting the collection, we noticed that some documents were repeated multiple times
within the same snapshot (e.g., doc3173607 appears 56 times in the September 2022 snapshot). To
address this, the Indexer was modified to check whether each document had already been processed
before indexing it.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Searcher</title>
        <p>The Searcher component is responsible for handling the search functionality in the information
retrieval system. It is designed to query the index, process the results, and output the relevant documents
based on the provided queries and QRELS. The implementation supports various features and settings
to try diferent approaches and improve or comparing retrieval processes, such as query expansion,
handling multiple languages (English and French), and parallelization to improve (time) performance.
The configurability of the Searcher object has allowed us to easily swap from a setup to the other, in
order to try as many configurations as possible, using diferent types of analyzers with diferent query
types.</p>
        <sec id="sec-3-4-1">
          <title>3.4.1. Query Processing</title>
          <p>
            The search process begins by reading the set of provided topics: a custom QueryReader object is used
by the Searcher class to parse the queries of a certain month and converting them in QualityQuery
objects, for mapping reasons, which can easily be processed by the system. For each of these
objects, a query based on the provided content, either in the form of a Lucene’s BooleanQuery or a
PhraseQuery, depending on the configuration, is created. The BooleanQuery is particularly useful
for basic matching, where the query terms are connected using logical operators, such as OR. On the
other hand, PhraseQuery is employed when the order of the terms in the query is important, allowing
for a better matching of phrases rather than individual terms. The Searcher object then also accepts
two analyzers, both for French and English, which hopefully have to be the same ones used to index
the corpora, to guarantee the best match based on the detected language of query terms. Language
detection, though, proved to be a challenging factor that impacted the search system’s performance.
Diferent systems exhibited inconsistent results depending on the configurations used. Even though we
developed a custom FrenchEnglishDetector, with better performances in terms of time, the Lingua
[
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] LanguageDetector seemed to perform better in average, even if it was slower.
          </p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.4.2. Parallelization</title>
          <p>To enhance the eficiency of the search process, especially to handle the large indexes and a big number
of queries, the Searcher has been designed to utilize parallel processing. The search tasks for each topic
are executed in parallel using Java’s ExecutorService. This approach enables the system to process
multiple queries simultaneously, reducing the overall search time and improving scalability. The number
of threads used for parallelization is configurable, with a default that matches the number of available
processors on the machine. This ensures optimal resource utilization and allows for fine-tuning based
on system specifications.</p>
          <p>To have and indicative measure of how much the performance improved, the searching was performed
roughly 70% quicker with 4 threads instead of 1.</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>3.4.3. Query Expansion</title>
          <p>Another challenging aspect has been query expansion. The basic process was to generate synonyms
from the query terms and analyze them to create the stems to build the queries to perform on the index.
This way we hoped to increase the potential matches for every query. To better understand the query
expansion methods used, see Section 3.4.5.</p>
          <p>For the BooleanQuery object the process was simply adding synonyms to them (with the OR clause).
However, we immediately noted a drastic fall in performance even when adding just a handful of
synonyms to the queries. We therefore decided to reduce that number as much as possible and to boost,
or more accurately, to penalize the strings not derived directly from the query terms, using Lucene’s
BoostQuery object. It is important to note that even boosting (or not boosting) a single term with a
factor of 1 worsened performance.</p>
          <p>Regarding the PhraseQuery objects, a simple way of enhancing queries has been to use the slop
parameter to generate permutations of the starting query. In addition, to better manage more complex
query expansion with synonyms, we decided to use a Lucene object designed specifically for this
application, the MultiPhraseQuery class.</p>
          <p>Unfortunately, as can be seen in Section 5, these procedures did not help improve performance.</p>
        </sec>
        <sec id="sec-3-4-4">
          <title>3.4.4. GPT elaborated queries</title>
          <p>To improve the quality of the queries, we processed them using GPT-4.0 Turbo asking the Large
Language Model (LLM) to restore proper diacritics, correct misspellings, and rewrite overly long queries.
In a real IR system, this processing would typically be done at runtime each time a new query is
submitted by a user. However, since we needed to handle a large number of queries, we chose to process
them in advance to avoid repeating the same procedure at every run (also considering that the APIs
for the chosen LLM are not free). Since the snapshots from June 2022 to September 2022 contained
numerous queries related to adult content, which caused problems when processed by GPT, we decided
to collect all queries from the remaining training set, remove duplicates, and improve only those. As a
result, queries from October 2022 to February 2023 were fully rewritten, while for the earlier months
the proportion of rewritten queries ranged from 78% to 88%.</p>
        </sec>
        <sec id="sec-3-4-5">
          <title>3.4.5. Synonyms Generator</title>
          <p>To perform query expansion, we developed a class capable of returning a list of synonyms for a given
English or French word. Our original idea was to accomplish this using two diferent techniques: one
based on fixed dictionaries and the other leveraging LLMs.</p>
          <p>
            For the dictionary-based approach, we used WordNet [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] for English queries and WOLF [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ], its French
counterpart, for French queries. While we had a CSV version of WordNet readily available, we had to
preprocess the WOLF XML file to extract synonym lists for each word. At runtime, the extracted synonym
sets are parsed into dictionaries, and when the proper methods are called, the SynonymsGenerator
provides a list of synonyms for the given word. This approach is very fast, but the generated synonyms
are not selected with regard to the context of the query, often leading to query drift and poor
efectiveness. Additionally, there were some strange links in the dictionaries, resulting, for example, in
gondola being provided as a synonym for the English term car, or clientélisme for the French
term voiture.
          </p>
          <p>To improve the quality of the synonyms, we first attempted to build a local Python server using Flask
and the model LLaMA-3.2-1B-Instruct, but this approach proved unsuccessful. Despite carefully
designed prompts explicitly instructing the model to return only the expanded query, the output was
often inconsistent or poorly formatted. For instance, when expanding the French query:
évolution voiture
the model returned:
évolution de la voiture, évolution automobile, évolution du véhicule,
évolution des voitures, avancement technique, amélioration continue,
innovation technologique, avancée scientifique, progrès industriel,
développement numérique, amélioration continue de la voiture, avancement
technologique, progrès de l’automobile, évolution de l’industrie
which was satisfactory in terms of format but slightly too broad in terms of relevance. However, for the
English version of the same query:</p>
          <p>car evolution
the model produced the following output:</p>
          <p>Original Query:\n Car evolution refers to the process of changing a
vehicle’s design, features, and technologies over time.\n Synonyms/variants:\n
1. Automotive evolution\n 2. Vehicle development\n 3. Car transformation\n
4. Motor vehicle evolution\n 5. Vehicle modernization
despite the prompt being identical in structure and clearly requesting a clean list of synonyms in plain
text, without any preamble or line breaks. Furthermore, the server required nearly 15 seconds to
process each request, making it impractical for handling the large number of queries provided in the
LongEval challenge. Similar problems were encountered when trying GPT-3.5 Turbo, whereas they
disappeared when testing with GPT-4.0 Turbo. Unfortunately, the cost of using the latter model’s
APIs was prohibitive given the volume of queries to be processed, and the use of LLMs for synonym
generation was ultimately abandoned.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Experimental Strategy</title>
        <p>Our experimental process was structured in multiple stages to explore the impact of diferent indexing
and querying configurations on retrieval performance. The training set results that guided our
methodological choices are presented in Section 5.2.</p>
        <p>Given the multilingual nature of the documents, we initially performed runs using sentence-level
language detection, where each sentence in a document was analyzed individually to assign
languagespecific analyzers. This approach ensured that documents containing both French and English sentences
(often due to browser-related boilerplate) were processed with the appropriate analyzers for each part.
In the first phase, we tested all available analyzer configurations using BooleanQuery in the searcher,
ifxing the maximum number of documents retrieved per topic at 300, to identify the most efective
configuration. The results showed that the best analyzer was minAnaBool1, so all subsequent steps
were carried out using this configuration.</p>
        <p>Next, we aimed to increase indexing speed by switching to document-level language detection, where
the entire document was assigned to either the English or French analyzer based on a global analysis,
instead of detecting the language at the sentence-level (based on a sample of 1000 documents, the
average number of sentences per document was approximately 30). As expected, this change reduced
indexing time and, surprisingly, also improved retrieval performance.</p>
        <p>To enhance precision, we tested the same analyzer using PhraseQuery to prioritize retrieving
documents where query terms appeared close to each other, allowing for a small margin to avoid overly
strict matching. A comparison between the previous BooleanQuery run and the new PhraseQuery
run confirmed the increase in efectiveness. We also increased the maximum number of retrievable
documents to 1000 for both BooleanQuery and PhraseQuery, observing only a negligible gain in
efectiveness but a notable increase in the number of relevant documents retrieved with BooleanQuery
(an average of 32%). As a result, we decided to set the maximum retrievable documents to 1000 for
BooleanQuery, while keeping it at 300 for PhraseQuery, since the increase in search time was not
justified by the marginal gain in recall.</p>
        <p>Recognizing that Qwant users often omit diacritics and that some queries contained errors or were
poorly structured, we applied LLM-based query correction as described in Section 3.4.4. The results
showed that queries re-elaborated by GPT-4.0 Turbo improved performance for BooleanQuery but
slightly decreased it for PhraseQuery.</p>
        <p>To further increase recall, we tested query expansion as described in Section 3.4.5, but this approach
backfired: not only did we retrieve fewer relevant documents, but we also observed a drop in
performance compared to the baseline, particularly for PhraseQuery. This was likely due to the poor
quality of the SynonymsGenerator, which led to query drift. Analyzing the topic-level MAP from the
December 2022 run, we observed that:
• 4227 queries (55%) were identical regardless of the introduction of query expansion;
• 1221 queries (16%) improved, with an average increase of 0.0542;
• 2213 queries (29%) worsened, with an average decrease of 0.0399.</p>
        <p>This indicates that while some generated synonyms were useful, most were not. Note also that we used
the GPT-elaborated queries for PhraseQuery as well, due to the structure of the SynonymsGenerator,
which favors words with proper diacritics.</p>
        <p>Finally, we sought to combine the strong precision of PhraseQuery with the broader recall of
BooleanQuery, performing both approaches together and applying diferent boost values when
combining them. We tested combinations both with and without GPT-elaborated queries, based on what
we had learned about their respective efects. The best results were achieved using boost values of 1.5
for PhraseQuery and 0.9 for BooleanQuery, with GPT-elaborated queries. This combination yielded
the highest MAP, nDCG, and number of relevant documents retrieved.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <sec id="sec-4-1">
        <title>4.1. Collection</title>
        <p>The experiments were conducted using the LongEval-Web Retrieval collection2, a benchmark specifically
designed to study the robustness and stability of web search engines over time. The collection aims to
address two fundamental questions in IR:
• How does a search engine behave as the underlying collection of documents evolves?
• When and how frequently should an IR system be updated to maintain performance as the
document collection changes?
The queries were created in French and extracted from real user logs, while the documents may contain
content in multiple languages, primarily French.</p>
        <p>The dataset is organized into 15 monthly snapshots (June 2022 – August 2023), which enables
finegrained evaluation of IR systems over time. Compared to previous LongEval editions (2023 and 2024),
the 2025 iteration of the collection features an expanded set of snapshots, allowing for more detailed
analysis of collection dynamics.</p>
        <p>The dataset is composed of:
• Training set: nine monthly snapshots from June 2022 to February 2023, containing approximately
18 million documents, more than 13,000 queries, and associated relevance judgments (qrels);
• Test set: six monthly snapshots from March 2023 to August 2023, including documents, queries
and relative qrels;
• Relevance judgments were derived from implicit user feedback, collected automatically via
a click model based on Dynamic Bayesian Networks trained on Qwant search logs. The model
outputs a probability of document attractiveness, which is then mapped to a 3-point relevance
scale (0 = not relevant, 1 = relevant, 2 = highly relevant). This automatic estimation process
is scheduled to be complemented with explicit human relevance assessments after the oficial
submission deadline.</p>
        <p>Overall, the full collection comprises fifteen distinct datasets, each corresponding to a specific monthly
snapshot.</p>
        <p>During the development of this project, we observed a significant performance gap between the
ifrst four months of the training collection and the subsequent five. To better understand the reasons
behind this, and with the hope of finding ways to improve our systems’ scores, we spent considerable
time analyzing the provided documents, queries and qrels. We wrote some Python scripts to speed up
frequent tasks, such as retrieving a document by its ID, and manually examined the files to identify
potential issues. We discovered that, at least for some queries, there is a clear mismatch between
the queries file and the qrels file. For example, query ID 1 for June 2022 refers to an adult content
website, but the documents marked as highly relevant in the qrels were about World War I (which
actually corresponds to query 3). A similar situation occurred with query 27, adenovirus, which was
associated in the qrels with a document about energy eficiency. Moreover, although the LongEval
challenge website states that filters were applied to exclude adult content from the collection, it was
not the case for the first months, possibly explaining the cause of these mismatches. Further analysis
revealed that even the last five months of the training collection were not entirely immune to this
problem, although the impact there was smaller. In any case, this manual review was crucial in helping
us decide which approaches to pursue and which to abandon as using certain methods would have
been misleading without reliable qrels to serve as ground truth.</p>
        <p>To address the issue, the organizers released an updated version of the training queries and qrels on
May 5th, 2025. While the misalignment between queries and qrels was resolved, the qrels still included
lines referring to queries that were not provided or to topics with no associated relevant documents.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Measures</title>
        <p>To assess the efectiveness of our retrieval system, we adopted three widely used evaluation
metrics: Mean Average Precision (MAP), Normalized Discounted Cumulated Gain (nDCG), and interpolated
precision at standard recall levels (iPrec@Recall). These metrics were selected because they capture
complementary aspects of retrieval quality:
• MAP measures the overall precision by averaging the precision scores at the ranks where relevant
documents occur. It reflects how well the system retrieves all relevant documents.
• nDCG evaluates the quality of the ranking by assigning higher importance to relevant documents
that appear earlier in the results list, which is especially important in web search scenarios.
• Interpolated precision at standard recall levels (iPrec@Recall) provides a global view of system
performance by reporting the highest precision achieved for any recall level greater than or equal
to a given threshold. This metric is a well-established tool for analyzing the trade-of between
precision and recall in a smooth and interpretable way.</p>
        <p>
          The evaluation was performed using trec_eval3, a standard tool for information retrieval
benchmarking. We used it to compare our system’s output (run files) against the oficial relevance judgments
(qrels) provided by the LongEval organizers [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. This ensured that our results followed a consistent and
widely accepted evaluation protocol.
3trec_eval repository: https://github.com/usnistgov/trec_eval
4.2.1. MATLAB
To support the analysis of our retrieval experiments, we developed a modular MATLAB framework
designed to transform the output of trec_eval into structured matrices suitable for both manipulation
and visualization of data. The framework handles monthly evaluation files for each run and metric,
converting them into consistent matrices depending on the specific measure.
        </p>
        <p>These matrices are then used in two main ways: to produce clear comparative plots of system
performance over time, and to enable statistical testing via methods such as ANalysis Of VAriance (ANOVA).
This dual purpose made the framework essential to our workflow, allowing us to eficiently organize,
interpret, and compare the results of a wide range of system configurations over various time-snapshots.
The following sections describe each script and its specific role in the evaluation process 4.</p>
        <p>Matrix Construction with Query IDs
This script processes the output files generated by trec_eval and converts them into structured
MATLAB matrices that explicitly preserve query identifiers. It supports three standard evaluation
metrics: MAP, nDCG, and interpolated precision at standard recall levels (iPrec@recall). The user can
choose to elaborate a single metric or all three sequentially.</p>
        <p>For MAP and nDCG, the script constructs a matrix where each row corresponds to a query, the first
column stores the query ID, and the remaining columns contain the scores for each month. This format
enables direct traceability and is suitable for the following analyses that require maintaining the link
between scores and their originating queries. For iPrec@recall, which aggregates scores across queries,
the matrix has fixed recall levels as rows and months as columns.</p>
        <p>The script automatically detects the unique set of query IDs present across all months, aligns scores
accordingly, and fills missing values with NaN to maintain dimensional consistency. The resulting
matrices are saved in dedicated .mat files following a standardized naming convention. This structure
is designed to speed up the process of importing in MATLAB the huge number of trec_eval
measurements for all our runs.</p>
        <p>Cross-System Comparison with Confidence Intervals
This script performs comparative evaluation of multiple retrieval systems over time, using mean scores
and 95% confidence intervals computed via Student’s t-distribution. The processing is based on
prealigned matrices generated by the previously described script.</p>
        <p>The script begins by aligning the set of query IDs across all systems. For each system, it inserts NaN rows
for queries that were retrieved by at least one other system but are missing in the current one. Missing
values are then zero-filled, but only in time snapshots where at least one system reported valid data.
This strategy avoids incorrectly assigning a score of zero to a query that may simply have been absent
from a particular month’s snapshot, such cases are better represented as NaN to preserve correctness in
downstream processing. Finally, rows containing no informative values across any system are removed
as a safeguard, although such cases are not expected in practice. Once alignment is complete, query IDs
are stripped and overall means, per system and metric, are printed (these means are reported in Table 4).
For each metric, the script can generate two types of plots: one showing mean performance over time,
and another using grouped error bars to visualize confidence intervals which are calculated for each
system and time point. Systems are automatically sorted by overall average performance for easier
comparison. A similar visualization is also supported for iPrec@Recall.</p>
        <p>Month-Specific ANOVA and Boxplot Analysis
This script performs a fine-grained statistical analysis of multiple retrieval systems by focusing on
a single time snapshot. It operates on per-query matrices generated for MAP and nDCG, evaluating
4It is important to clarify that this section is meant to illustrate the capabilities of the described scripts. As such, we do not
present every possible plot that could be generated, but rather focus on those that are most informative and relevant to our
analysis.
system performance at a selected month using two-way ANOVA and visual comparison.
The preprocessing and alignment of query scores across systems follow the same procedure described
in the previous script. After alignment, the script isolates the column corresponding to the target month
and constructs an evaluation matrix for each metric (where rows corresponds to queries and columns
to systems), which will then be fed to MATLAB’s anova2 function.</p>
        <p>For both MAP and nDCG, the script runs a two-way ANOVA to assess whether performance diferences
between systems are statistically significant. The results are visualized using MATLAB’s multcompare
tool and sorted boxplots that illustrate distributional diferences in scores across systems.
This script provides a more focused comparison than time-series analysis, enabling robust identification
of statistical significant performance diferences at specific stages of the evaluation timeline.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Repository</title>
        <p>The full source code of the project is available at the following Bitbucket repository:
https://bitbucket.org/upd-dei-stud-prj/seupd2425-sard/src/master/
The repository contains all scripts and resources needed to reproduce the indexing, retrieval, and
evaluation process described in this report.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Hardware</title>
        <p>To perform the experimentations, multiple hardware setups were used. Due to the high computational
cost and long execution time of the indexing process, especially when testing diferent analyzers,
the group decided to distribute the workload among its members. Each member was responsible for
indexing the dataset using a specific analyzer on their personal machine. To maintain consistency,
each member also conducted the corresponding search experiments on the index they had created,
using a searcher configured with the same analyzer setup. This ensured the validity of the results while
balancing both the indexing and searching load across the group. The table below summarizes the
hardware specifications of the systems used:</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Runs Description</title>
        <p>In this section, we provide a brief description of the runs executed using the updated queries and qrels
released after the correction of the originally distributed files 5 produced (runs used just for parameter
tuning and with previous queries and qrels are omitted), along with the corresponding run identifiers
used later in the performance evaluation.</p>
        <p>All runs were executed using the BM25Similarity scoring function, they rely on the French and
English documents collection, with filters and preprocessing steps as described in Section 3.2, and use
5The sufix _newQuery in the run names is used to distinguish them from those based on the old queries and qrels. The
corresponding evaluation scores for the old runs are still available in our project repository for comparison purposes.
stoplists english_stopwords.txt and french_stopwords.txt, unless otherwise specified.
All initial runs employed a BooleanQuery strategy during the search phase, aimed at evaluating the
efectiveness of diferent analyzers:
• stdAnaBool_newQuery: uses Lucene’s StandardAnalyzer for both English and French
collections;
• minAnaBool1_newQuery: uses FrenchMinimalStemFilter and EnglishMinimalStemFilter
for stemming;
• liminAnaBool_newQuery: combines FrenchLightStemFilter stemming for French text
and EnglishMinimalStemFilter stemming for English text;
• snowAnaBool_newQuery: applies SnowballFilter stemming for both languages;
• porterAna1Bool_newQuery: applies FrenchLightStemFilter for French and PorterStemFilter
for English.</p>
        <p>Once the best analyzer configuration was identified as the one used in minAnaBool1_newQuery, we
retained it for all subsequent runs. Additionally, all following runs used the index built with
documentlevel language detection, varying the query types, the number of retrievable documents, and the applied
query manipulations:
• minAnaBool300_newQuery: uses Boolean queries, retrieves at most 300 documents per topic;
• minAnaBool1000_newQuery: uses Boolean queries, retrieves at most 1000 documents per
topic;
• minAnaPhr300_newQuery: uses Phrase queries, retrieves at most 300 documents per topic;
• minAnaBool1000_GPT_newQuery: uses Boolean queries, retrieves at most 1000 documents
per topic, with GPT-processed queries;
• minAnaPhr300_GPT_newQuery: uses Phrase queries, retrieves at most 300 documents per
topic, with GPT-processed queries;
• minAnaBool1000_GPT_QEDict_newQuery: uses Boolean queries, retrieves at most 1000
documents per topic, with GPT-processed queries, dictionary-based query expansion;
• minAnaPhr300_GPT_QEDict_newQuery: uses Phrase queries, retrieves at most 300
documents per topic, with GPT-processed queries, dictionary-based query expansion;
• minAnaMixed_newQuery: uses combination of Boolean (boost 0.9) and Phrase (boost 2.0)
queries, retrieves at most 1000 documents per topic (best model without GPT-processed queries);
• minAnaMixed3_GPT_newQuery: uses combination of Boolean (boost 0.9) and Phrase (boost
1.5) queries, retrieves at most 1000 documents per topic, with GPT-processed queries (best model).</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results on the Training Set</title>
        <p>This section presents a comparative evaluation of system performance on the training set across
configurations and query strategies. The metrics considered are MAP, nDCG, and interpolated precision at
standard recall levels (iPrec@Recall), as introduced in Section 4.2. While all reported metrics were
taken into account in guiding the development process, we present here only the plots of the metrics
that, in each case, best illustrate the results, to improve clarity. All visualizations were generated using
the MATLAB framework described in Section 4.2.1.</p>
        <p>It should be noted that the key decisions were originally based on the old queries and qrels, but were
later confirmed by the results obtained with the updated ones. These are the results presented in this
section. As mentioned in Section 3.5, in the early stage of experimentation, our goal was to determine
the most suitable analyzer to adopt in the subsequent configurations. By comparing five diferent
analyzers using Boolean queries, we observed that all analyzers with diferent stemming behaved
similarly in terms of MAP and nDCG. The only configuration that consistently underperforms across
all metrics is stdAnaBool_newQuery, based on Lucene’s default StandardAnalyzer. Among the
remaining configurations, minAnaBool1_newQuery, based on minimal stemming, shows slightly
higher performance, as can be clearly observed in the interpolated precision plot (Figure 1).
Based on these results, we adopted its configuration as the reference analyzer in all subsequent
experiments.</p>
        <p>After selecting the best analyzer, we proceeded to test additional configurations involving diferent
query strategies, document limits, and query reformulations. The figures below report the comparative
results of these extended experiments for MAP (Figure 2) and nDCG (Figure 3).</p>
        <p>
          The results showed that: increasing the number of retrieved documents was efective for Boolean
queries; query rewriting using GPT-4.0 Turbo proved beneficial for Boolean queries but led to a drop
in scores for Phrase queries; query expansion slightly decreased the performance of Boolean queries
and significantly worsened the results for Phrase queries; the hybrid strategy combining Boolean and
Phrase queries turned out to be the most efective overall. It is also interesting to note that Phrase
queries led to an improvement in MAP, while causing a significant drop in nDCG, indicating that the
system is efective at identifying relevant documents, but less so at ranking them appropriately.
Table 2 shows MAP and nDCG scores for all the systems discussed; the ones submitted to the LongEval
challenge are shown in bold.
To improve the interpretability of our findings, we paired the raw performance results with statistical
testing. Specifically, we performed a two-way ANOVA followed by Tukey’s HSD multiple comparisons
on the results for the 2023-02 snapshot, focusing on nDCG as the main evaluation metric. This particular
snapshot was selected due to its high similarity to those included in the test collection [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The results
of this analysis are reported in Table 3 and Figure 4b, while the corresponding boxplots (Figure 4a)
highlight the distribution and relative stability of each system’s performance. All statistical tests were
conducted at a significance level  = 0.05.
(a) Boxplot of nDCG scores
(b) Multiple comparisons of nDCG scores
        </p>
        <p>The results clearly indicate that the systems are not all statistically equivalent; in particular, the submitted
ones are all significantly diferent from one another. The system based on Phrase queries combined
with query expansion shows a distinctive pattern in the boxplots: due to the presence of many topics
with zero scores, the few with high scores appear as outliers in the overall distribution.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Results on the Test Set</title>
        <p>In this section, we report the results obtained by running the selected configurations on the oficial
LongEval test set. Only the systems marked in bold in Table 2 (training results) were submitted,
and are thus included here. These systems reflect the most efective combinations of query
strategies, analyzers, and reformulation techniques identified during development. We have also included
minAnaBool300_testColl as a baseline to assess how our developments have impacted performance,
as it was the first system built after fixing the indexing strategy.</p>
        <p>The evaluation metrics adopted are consistent with those used for the training set. However, due to the
qrels issues highlighted in Section 4.1, the numerical results cannot be directly compared between the
training and test sets.</p>
        <p>To show the performance trends across the diferent test months, we include two plots: Figure 5 for
MAP and Figure 6 for nDCG, while the corresponding average scores are reported in Table 4. The
results confirm the trends observed on the training set in terms of the relative ranking of systems,
and highlight how the last snapshot, which is the most distant in time from the training set, shows a
degradation in both metrics.</p>
        <p>A more detailed analysis, involving two-way ANOVA and Tukey’s HSD multiple comparisons, was
conducted on a subset of months from the test collection. In particular, we focused on March, April, and
August 2023, which correspond to the first, second, and last snapshots in the collection, respectively.
5.3.1. March 2023 Analysis
The analysis reveals significant diferences in nDCG performance across systems, with systems using
hybrid queries clearly outperforming others. Although by a small margin, the system using Phrase
queries outperforms the baseline and the diference is statistically significant. In this initial month of the
test collection, all systems difer significantly from one another, as shown by the multiple comparisons
(Figure 7b).
(a) Boxplot of nDCG scores
(b) Multiple comparisons of nDCG scores
5.3.2. April 2023 Analysis
The second month of the test collection shows results largely consistent with the first, with the same
relative ranking among systems. This confirms the trend already observed in the training set and
reinforced by the March test results. The only notable diference compared to the previous month is
that the system using Phrase queries is no longer statistically diferent from the baseline.</p>
        <p>Source
Columns
Rows
Error
Total</p>
        <sec id="sec-5-3-1">
          <title>5.3.3. August 2023 Analysis</title>
          <p>The final month of the test collection confirms the same relative ranking among systems and the
significant performance gap between hybrid-query based systems and the others, highlighting the
robustness of these approaches over time. As in April, the system based on Phrase queries is statistically
indistinguishable from the baseline. It is also worth noting that all systems report lower nDCG scores
in this month compared to the previous two analyzed, suggesting increased dificulty when the test set
diverges more substantially from the training months on which systems parameters and configurations
were tuned.</p>
          <p>Overall, across the three snapshots of the test set (March, April, and August 2023), the results confirm
the superior performance and robustness of hybrid-query systems combining Boolean and Phrase
queries. Although all systems experienced some degradation in August, the snapshot most temporally
distant from the training data, the relative ranking among systems remained consistent. Notably,
the system using Phrase queries alone showed a marked drop in nDCG in August, with half of the
topics scoring near zero, suggesting a particular sensitivity to temporal drift in topic-document relevance.
Another important observation concerns the system’s performance with Phrase queries, as illustrated
by the boxplot in Figure 9a. The position of the red line indicates that for half of the topics, the nDCG
score is nearly zero, an outcome that markedly difers from the results observed in March and April.</p>
          <p>Source
Columns
Rows
Error
Total</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>Throughout the project, we observed that even light query processing using GPT to correct errors
in queries led to significant improvements when used on Boolean models. We found that combining
Boolean queries with Phrase queries outperformed each method individually, confirming the added value
of this hybrid approach. As we progressed through the various steps of development, we consistently
saw improvements in retrieval performance.</p>
      <p>However, query expansion, especially when using dictionaries, did not yield the expected benefits. We
concluded that this approach was too generic, particularly for French, where available resources, such
as WordNet for English, are not as rich. The limitations of French language tools ultimately led to
suboptimal results in the query expansion process.</p>
      <p>Overall, the project demonstrated that even small enhancements in query handling can lead to substantial
gains in retrieval performance, but the choice of tools and approaches, especially for less-resourced
languages like French, remains crucial for achieving optimal results.</p>
      <p>Future work will focus on incorporating more advanced retrieval techniques that could not be explored
in the current project due to hardware constraints and limitations in the available data. In particular, the
use of LLMs for tasks such as query rewriting, expansion, and contextual understanding represents a
promising direction. These models can enhance retrieval efectiveness by generating more semantically
rich and intent-aware queries.</p>
      <p>Another avenue we intend to explore is the application of supervised Learning to Rank methods
such as LambdaMART, which have shown strong performance in retrieval competitions and industry
applications. However, the efectiveness of such models strongly depends on the availability of reliable
and fine-grained relevance judgments.</p>
      <p>Overall, advancing toward these techniques will require both a more robust computational infrastructure
and improvements in data quality. Nonetheless, their integration could significantly push the boundaries
of performance within the LongEval framework and similar longitudinal retrieval settings.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4 via ChatGPT for two main purposes: (i) to
assist in rewriting user queries in natural language during system development and evaluation; and
(ii) to improve the grammatical accuracy and fluency of some textual sections of this manuscript. All
content generated by the AI tool was thoroughly reviewed, edited, and validated by the authors, who
accept full responsibility for the final submitted version, in compliance with the CEUR-WS policy on
the use of generative AI tools. (https://ceur-ws.org/GenAI/Policy.html).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] CLEF LongEval, CLEF LongEval Tasks Overview</article-title>
          , https://clef-longeval.github.io/tasks/,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -04-19.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cancellieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Ebshihy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez-Saez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iommi</surname>
          </string-name>
          , J. Keller, P. Knoth,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pride</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schaer</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2025 LongEval Lab on Longitudinal Evaluation of Model Performance</article-title>
          , in: J.
          <string-name>
            <surname>Carrillo-de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bellin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Carè</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Pepaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salvalaio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Segala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tognon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          , SEUPD@CLEF:
          <article-title>Team GWCA on Longitudinal Evaluation of IR Systems by Using Query Expansion and Learning To Rank</article-title>
          , in: Conference and
          <article-title>Labs of the Evaluation Forum https://ceur-ws</article-title>
          .
          <source>org/</source>
          Vol-
          <volume>3497</volume>
          /paper-186.pdf,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alkhalifa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Borkakoty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deveaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Ebshihy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Espinosa-Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez-Saez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iommi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liakata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Madabushi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Medina-Alias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <article-title>Extended Overview of the CLEF 2024 LongEval Lab on Longitudinal Evaluation of Model Performance , in: Conference and Labs of the Evaluation Forum https: //ceur-ws</article-title>
          .
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-213.pdf,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alkhalifa</surname>
          </string-name>
          , I. Bilal,
          <string-name>
            <given-names>H.</given-names>
            <surname>Borkakoty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deveaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Ebshihy</surname>
          </string-name>
          , L. EspinosaAnke,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez-Saez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          , E. Kochkina,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liakata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Loureiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Servan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Madabushi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <article-title>Extended Overview of the CLEF-2023 LongEval Lab on Longitudinal Evaluation of Model Performance , in: Conference and Labs of the Evaluation Forum https://ceur-ws</article-title>
          .
          <source>org/</source>
          Vol-
          <volume>3497</volume>
          /paper-184.pdf,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Oracle</given-names>
            <surname>Java</surname>
          </string-name>
          <string-name>
            <surname>Magazine</surname>
          </string-name>
          ,
          <source>Java JSON Serialization with Jackson</source>
          , https://blogs.oracle.com/ javamagazine/post/java-json
          <article-title>-serialization-</article-title>
          <string-name>
            <surname>jackson</surname>
          </string-name>
          ,
          <year>2022</year>
          . Accessed:
          <fpage>2025</fpage>
          -04-19.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>[7] PeMISTAHL, Lingua: A Language Detection Library for Java and Kotlin</article-title>
          , https://github.com/ pemistahl/lingua,
          <year>2022</year>
          . Accessed:
          <fpage>2025</fpage>
          -04-19.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Priceton</given-names>
            <surname>University</surname>
          </string-name>
          , About WordNet, https://wordnet.princeton.edu/,
          <year>2010</year>
          . Accessed:
          <fpage>2025</fpage>
          -04-26.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fišer</surname>
          </string-name>
          ,
          <article-title>Building a free french wordnet from multilingual resources</article-title>
          ,
          <source>in: Ontolex</source>
          <year>2008</year>
          , Marrakech, Morocco,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <article-title>CLEF LongEval, CLEF LongEval Data Overview</article-title>
          , https://clef-longeval.github.io/data/,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -05-04.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>