<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Team RAND at LongEval: Composable Information Retrieval with Semantic and Language-Aware Components</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgia Amato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Brentel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Demo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Steven Laghetto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Pivotto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iegor Toporov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents our work done by team RAND (University of Padua) on LongEval-Web Retrieval challenge, which investigates the robustness and stability of Web search engines in the context of evolving document collections. Our team began by analyzing the dataset and applying standard IR techniques to establish baseline performance. We then iteratively refined our approach by focusing on methods that demonstrated improved efectiveness in handling temporal changes across snapshots.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LongEval 2025</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Query Expansion</kwd>
        <kwd>Filters</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Section 4 discusses our main findings;
Section 5 draws some conclusions and outlooks for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>We developed this information retrieval system by applying the theoretical foundations covered in
the course alongside the practical implementations introduced during the tutoring sessions. The
development began with the creation of the RANDDocument class and the RANDUniversalParser,
which together enable the conversion of document collections in diferent formats ( .trec and .json)
into a unified RANDDocument object. Subsequently, we implemented the RANDAnalyzer class, capable
of performing tokenization, stopword removal, and stemming operations, configurable for multiple
languages. We would like to highlight that the Vocabulary Statistics class and its implementation was
inspired by the project created by a group from the same course two years ago, specifically Team
GWCA1 [1]. The next step involved designing the search component, which executes topic-based
searches over the indexed collections using a query file and produces output in standard TREC format.
The final stage of the pipeline is evaluation, carried out through the Evaluator class. This module
leverages the trec_eval tool to compute standard information retrieval metrics such as the number
of retrieved and relevant documents, DCG, and more. Finally, all components are integrated through
the Main class, which orchestrates the execution pipeline with the support of the FileCollector
utility—allowing for recursive file retrieval from nested directories.</p>
      <sec id="sec-2-1">
        <title>2.1. Workflow and Foundations</title>
        <sec id="sec-2-1-1">
          <title>2.1.1. Base System</title>
          <p>The base information retrieval pipeline is structured into three primary stages: parsing, indexing, and
searching. During the parsing phase, the system supports input in both JSON and TREC formats. Text
content is preprocessed to remove unwanted elements such as HTML tags, hyperlinks, special characters,
and other noise, ensuring a clean and consistent representation of the data. In the indexing phase, the
cleaned documents are transformed into a searchable format using the Lucene indexing framework.
This stage supports extensive customization of text processing operations, including tokenization,
normalization, and similarity modeling, allowing flexibility in how documents are represented and
ranked. The searching phase involves executing queries over the constructed index. The system
loads a predefined set of queries and processes them using the same text analysis pipeline employed
during indexing. Advanced techniques such as tokenization and synonym-based query expansion are
supported to improve recall and relevance. Additionally, semantic enhancement tools can be integrated
into the process to further refine the search results. Finally, the retrieved and ranked documents are
exported in a standardized TREC format, ensuring compatibility with established evaluation tools.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.2. Re-ranking</title>
          <p>To enhance the precision of retrieval and promote more relevant documents to higher ranks, a re-ranking
mechanism has been implemented within the RANDSearcher class. This step is optional and is triggered
only when the top-scoring document from the initial Lucene-based search does not meet a configurable
similarity threshold. The reranking phase functions as a post-retrieval adjustment, reevaluating the
list of retrieved documents using semantic similarity techniques. After Lucene retrieves the initial list
of documents (as ScoreDoc[] objects), the system computes semantic relevance between the query
and the document bodies using one of two models: a Deep Learning (DL) model or a Large Language
Model (LLM). The choice between these is defined in the system’s SearcherParams configuration.
The reranking process unfolds as follows:
1GWCA Team’s repository: https://bitbucket.org/upd-dei-stud-prj/seupd2223-gwca/src/master/.</p>
          <p>1. Initial Retrieval: A Lucene query is generated and executed using BM25 scoring. If synonym
expansion is enabled, the query terms are augmented using data from the WOLF lexical database
for French, and corresponding SynonymQuery terms are added.
2. Reranking Trigger: If the highest initial score is below a set threshold, re-ranking is activated.</p>
          <p>The threshold is used for time saving purposes, this addition is something counterintuitive for
the standard re-ranking process, anyway the concept is the following:
To optimize processing time in our search engine pipeline, we introduce a threshold-based
mechanism in the re-ranking phase. Specifically, if a document receives a relevance score below
6 during the initial ranking stage, it is passed through the re-ranking process. Conversely,
documents with a score equal to or above 6 are considered suficiently relevant and are excluded
from re-ranking. This strategy is based on the observation that high-scoring documents in the
initial stage typically already align well with the user’s query, making additional re-ranking
unnecessary. By applying this threshold, we reduce computational overhead in terms of time,
while maintaining high retrieval efectiveness.</p>
          <p>Each document in the top- list is then semantically compared to the query.
3. Semantic Scoring:
• In the DL configuration, a pre-trained sentence embedding model (e.g., Sentence
Transformers) is used to compute vector representations and similarity scores.
• In the LLM configuration, a large language model (such as GPT-like APIs) is employed to
assess the query-document semantic match.
4. Score Adjustment: The semantic similarity score is scaled and added to the original Lucene
score. This score fusion ensures that both lexical and semantic signals contribute to the final
ranking. The following formula is applied:</p>
          <p>sd[i].score += (float)(5 * semanticScore[i]);
5. Output Generation: The re-ranked list is written to a TREC-compatible run file. Duplicates are
avoided to maintain evaluation consistency.</p>
          <p>This reranking module is particularly efective in addressing vocabulary mismatches and handling
queries that require deeper semantic understanding. It is fully configurable and can be toggled between
DL, LLM, or disabled modes via system parameters. Empirical results confirm that semantic reranking
significantly improves early precision metrics (e.g., nDCG@10), without compromising the stability of
the overall ranking.</p>
          <p>Deep Learning Models For the deep learning reranking phase, we evaluated two pre-trained
sentence embedding models: multi-qa-MiniLM-L6-cos-v1 and multi-qa-L6-cos-v1 [2], both
optimized for multilingual semantic similarity tasks. While the multi-qa-L6-cos-v1 model ofers
richer representations, its high computational cost made it impractical on large document sets. We
therefore selected the lighter multi-qa-MiniLM-L6-cos-v1, which balances speed and semantic
accuracy efectively. To keep runtimes reasonable, we limited reranking to the top 100 Lucene-retrieved
documents. We observed a clear tradeof: retrieving fewer than 100 documents yielded poor precision
metrics (e.g., nDCG@10, MAP), whereas increasing the number of retrieved documents improved these
metrics but caused runtime to increase exponentially due to the quadratic growth in pairwise similarity
computations.</p>
          <p>Large Language Model As an alternative, we explored reranking using the
unsloth/Llama-3.2-1B-Instruct model [3], a Large Language Model designed for
instructionfollowing tasks. In this setup, the system constructs a custom prompt that includes the query and
candidate documents, and the LLM returns a relevance score for each pair. While the LLM provided
advanced semantic reasoning, we ultimately excluded it from the final system due to its excessive
runtime and high computational demands, especially when processing large document batches.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Implementation</title>
        <p>This section provides an overview of the key classes developed to construct the information retrieval
system. The final execution and orchestration of all components are managed by the Main class.</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. RANDDocument</title>
          <p>The RANDDocument class serves as an abstraction layer for indexed and searchable documents. It is
designed to handle both TREC and JSON input formats, providing a unified structure for the indexing
pipeline. Upon ingestion, this class parses the raw data and converts it into a format compatible with
the Lucene indexing engine. Specifically, it defines two main fields:
• IdField: A unique identifier for each document. This field is stored but not tokenized, ensuring
fast and precise retrieval.
• BodyField: Contains the main textual content of the document. It is tokenized for full-text search
and also explicitly stored to support potential re-ranking operations based on semantic similarity.
By supporting both TREC and JSON formats, the RANDDocument class allows the system to seamlessly
process heterogeneous datasets while maintaining eficient indexing and retrieval.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. RANDParser</title>
          <p>Given the need to manage diferent input formats (i.e., JSON and TREC), we developed a flexible parser
named RANDUniversalParser.java. This class dynamically selects the appropriate parser based on
the input file’s extension:
• If the file is in JSON format, it is handled by RANDJsonParser.java.
• If the file is in TREC format, it is delegated to RANDTrecParser.java.</p>
          <p>• Files with unsupported extensions trigger an error message.</p>
          <p>RANDJsonParser.java is based on the WapoParser.java presented during the tutoring sessions.
It uses the Jackson library to process JSON content. The core method, next(), iterates through the
collection and cleans each document by removing elements such as emojis, URLs, and HTML tags
before returning it.</p>
          <p>RANDTrecParser.java, derived from the TipsterParser.java example discussed in class, is a
lightweight parser for TREC-formatted files. It identifies documents enclosed within &lt;DOC&gt; and &lt;/DOC&gt;
tags, extracts the document ID and content, and applies similar cleaning steps via the cleanContent()
method.</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>2.2.3. RANDAnalyzer</title>
          <p>The RANDAnalyzer class defines the text analysis pipeline used in our Information Retrieval system. It
was designed to ofer maximum flexibility during experimentation, making it a fully hybrid component
capable of processing documents in both English and French. Thanks to the use of an external XML
configuration file, all analysis parameters can be flexibly defined without modifying the codebase.
This setup enables the selection of diferent tokenizers and filters depending on the chosen language
configuration. The system supports three tokenizer types, configurable via the TokenizerType
parameter:
• WhitespaceTokenizer
• LetterTokenizer
• StandardTokenizer
After tokenization, terms are passed through a series of filters:
• LowerCaseFilter: converts terms to lowercase,
• LengthFilter: removes terms based on length,
• StopFilter: eliminates stopwords.</p>
          <p>Two main analysis pipelines were developed: one for English and one for French.</p>
          <p>English configuration:
• EnglishPossessiveFilter: removes possessive sufixes.
• Followed by a selectable stemmer:
– EnglishMinimalStemFilter
– PorterStemFilter
– KStemFilter
– SnowballFilter
French configuration:
• FrenchLightStemFilter
• ICUFoldingFilter
• ElisionFilter (initially included but later excluded due to negative performance impact)
After extensive testing, the French pipeline—excluding the ElisionFilter—demonstrated the best
retrieval performance, especially given that most relevant documents in the dataset were in French.
These findings are further illustrated in the results shown in Section 4.2.</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>2.2.4. VocabularyStatistics</title>
          <p>To verify index integrity and inspect the terms it contains, we initially considered using Luke (Lucene
Index Toolbox). However, due to its limited usability for our specific needs, we explored alternative tools.
Through prior projects, we identified a custom Java utility called VocabularyStatistics, developed
by a previous student group (GWCA2 [1]), which proved well-suited to our requirements. This utility
generates a text file listing all terms indexed by Lucene, along with their frequency statistics. The
main output file, vocabulary.txt, contains all terms sorted in descending order by raw frequency.
Additionally, the tool can automatically append the most frequent terms to the stoplist (removing any
duplicates). However, our testing showed that the manually curated oracleFrench.txt stoplist was
already highly optimized. As a result, we chose to retain the original file and disable the automatic
expansion feature, which had shown to degrade performance.</p>
        </sec>
        <sec id="sec-2-2-5">
          <title>2.2.5. RANDIndexer</title>
          <p>The RANDIndexer class is responsible for constructing a Lucene index from a collection of parsed
documents. Its primary functions include:
• Creating the Lucene index based on RANDDocument objects produced by the parser.
• Avoiding duplicate entries and filtering out invalid documents.
• Enabling configuration of analyzers and similarity functions (e.g., BM25) to tailor the indexing
behavior.</p>
          <p>The indexer is composed of the following core components:
2GWCA Team’s repository: https://bitbucket.org/upd-dei-stud-prj/seupd2223-gwca/src/master/.</p>
          <p>Constructor Parameters:
• An Analyzer for tokenizing and processing document content.
• A Similarity model for scoring (e.g., BM25).
• An integer value defining the amount of RAM allocated for document bufering before flushing
to disk.
• The path specifying where the index is stored.</p>
          <p>• A Parser instance responsible for reading and transforming the source documents.
index() Method:
• Validates input documents, skipping null entries or those without a valid ID.
• Converts each RANDDocument into a Lucene Document using the parser’s
toLuceneDocument() method.
• Adds the documents to the index and finalizes the process with a commit and close operation.
• Outputs a report detailing the number of indexed, skipped, and duplicated documents.</p>
        </sec>
        <sec id="sec-2-2-6">
          <title>2.2.6. RANDSearcher</title>
          <p>The RANDSearcher class implements the search functionality of our system. It leverages Lucene’s
indexing infrastructure to retrieve documents relevant to a given set of queries in TREC format.
Queries are read from a tab-separated file where each line contains a query ID and its corresponding
description. These are analyzed using a RANDAnalyzer, ensuring consistency with the indexing phase.
A BooleanQuery is then constructed, targeting the Body field of the indexed documents. Each query
term is converted into a Lucene TermQuery. If query expansion is enabled, the WolfManager class
provides synonym terms using the WOLF (Wordnet Libre du Français) resource. These synonyms
are incorporated into the search via SynonymQuery objects with weighted contributions. The search
process proceeds as follows.</p>
          <p>The search process includes:
1. Initial Retrieval: Lucene search using BM25 (with optional synonym expansion).
2. Optional Re-ranking: Semantic similarity via DL or LLM models.
3. Score Fusion: The system combines lexical and semantic scores, as previously described in the</p>
          <p>Re-ranking section
4. Output: Results saved in TREC format, ensuring no duplicates.</p>
          <p>This entire search and re-ranking pipeline is controlled by an XML configuration ( SearcherParams).
It supports customization of analyzers, similarity models, synonym expansion, and re-ranking methods.
The getReRanker() parameter can be set to DL, LLM, or None, enabling fine-grained control over the
system’s semantic capabilities.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setup</title>
      <sec id="sec-3-1">
        <title>3.1. Collection</title>
        <p>For our project, we utilized the LongEval 2025 Web Retrieval Train Collection, curated by TU
Wien in collaboration with Qwant.</p>
        <p>This dataset is designed to support research in temporal information retrieval and the evaluation of
long-term relevance persistence.</p>
        <p>Dataset Composition:
• Queries: 9,000 user-issued queries focused on trending topics, extracted from Qwant’s search
logs between June 2022 and February 2023.
• Documents: Approximately 18 million web documents retrieved in response to those queries,
including both clicked results and randomly sampled content from Qwant’s index.
• Relevance Judgments: Graded relevance labels derived from a click model, reflecting user
interaction and engagement patterns.</p>
        <p>Data Characteristics:
• The dataset includes original French versions of both queries and documents, adding a multilingual
dimension to retrieval tasks.
• Documents are stored in JSON format, with metadata such as document ID, URL, and timestamp,
enabling structured parsing and temporal analysis.
• The temporal coverage of the data captures language evolution and topical drift, challenging
systems to maintain performance over time.</p>
        <p>This collection is used as the oficial training set for the LongEval 2025 Information Retrieval Lab at
CLEF and provides a robust benchmark for evaluating retrieval systems in dynamic and realistic web
search scenarios.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluation Measures</title>
        <p>We evaluate the performance of our information retrieval system using three key metrics: Mean
Average Precision (MAP), Normalized Discounted Cumulative Gain (nDCG), and Normalized
Discounted Cumulative Gain at rank 10 (nDCG@10).</p>
        <p>• MAP (Mean Average Precision): MAP provides an overall measure of the retrieval ability of
the system to return relevant documents.</p>
        <p>It calculates the average precision for each query and then averages these values across all queries.
This metric is useful for assessing how well the system ranks relevant documents throughout the
entire result list.
• nDCG (Normalized Discounted Cumulative Gain): nDCG evaluates the ranking quality of
the top documents by giving more weight to relevant documents at the top of the result list.
It is particularly valuable for assessing systems where the relevance of documents in higher ranks
is more significant.
• nDCG@10 (Normalized Discounted Cumulative Gain at rank 10): This variation of nDCG
emphasizes the relevance of the top 10 documents.</p>
        <p>It applies a logarithmic discount to the relevance of documents at lower ranks, giving more
importance to those ranked near the top.</p>
        <p>We used the oficial trec_eval tool to calculate both MAP and nDCG metrics based on our system
outputs and the provided qrels file.</p>
        <p>The results will be discussed in detail in Section4.2.
3.3. Git
To develop the project, our group made use extensively of git in order to collaborate and organize the
ifles.</p>
        <p>The full source code is available on our BitBucket Repository.3
3RAND bitbucket repository: https://bitbucket.org/upd-dei-stud-prj/seupd2425-rand/src/master/.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Hardware</title>
        <p>The used hardware involved diferent systems to manage the heavy workload on the machines more
time-eficiently.</p>
        <p>These were the machines that were used:
Hardware specification of the four machines used in the experiments</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Runs description</title>
        <p>stoplist, stemmer used for the runs.</p>
        <p>Run ID
seupd2425-rand-nofilters</p>
        <p>Tokenizer
Whitespace</p>
        <p>Filters</p>
        <p>LowerCase, Length
seupd2425-rand-queryLength</p>
        <p>Whitespace</p>
        <p>LowerCase, Length
seupd2425-rand-frenchFilter</p>
        <p>Whitespace</p>
        <p>LowerCase, Length
seupd2425-rand-DL</p>
        <p>Whitespace</p>
        <p>LowerCase, Length
seupd2425-rand-synonyms</p>
        <p>Whitespace</p>
        <p>LowerCase, Length
seupd2425-rand-elision+synonyms0.5</p>
        <p>Whitespace</p>
        <p>LowerCase, Length, Elision
seupd2425-rand-ICU</p>
        <p>Whitespace</p>
        <p>LowerCase, Length, ICUFolding</p>
        <p>Oracle French
Detailed summary of system configurations and approaches tested.</p>
        <p>Here we describe the configurations of our system and their results, with a summary of tokenizer, filters,
seupd2425-rand-englishFilter</p>
        <p>Whitespace</p>
        <p>LowerCase, Length, English Possessive</p>
        <p>Azure (Igor Brigadir)</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results on the training set</title>
        <p>The table below presents the final averaged evaluation metrics for each system configuration (run).
We report three key metrics: nDCG, nDCG@10, and MAP. These metrics provide a comprehensive
view of the retrieval performance across diferent strategies applied during the experiments, including
language filtering, synonym expansion, and reranking.</p>
        <p>The reported values represent the average scores obtained over nine monthly runs. These results help
to highlight the relative strengths of each configuration and guide the selection of the most efective
retrieval strategy.</p>
        <p>
          As discussed previously, the ICU configuration (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) was revealed to have the best performance in terms
of each of the three evaluation measures considered for the comparison.
        </p>
        <p>
          The FrenchFilter (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) runs turned out to have the second-best nDCG, while the re-ranking mechanism
adopted in the DL configuration (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) leads to higher nDCG@10 and MAP values, as expected.
The chart (Fig.1) shows the nDCG trend over time, evaluating the efectiveness of diferent parameter
configurations in search runs; in particular, for re-ranking we retrieved a maximum of 100 documents
per query, due to the execution time being too long to retrieve 1000 documents per query, this is also
why the nDCG parameter is particularly low.
        </p>
        <p>
          In particular, we can deduce the following trends from the graph:
• All configurations show a clear improvement starting around October 2022, probably due to the
higher quality of the collections (e.g., better structured, less noisy), making it easier to return
relevant documents.
• The most notable gains are seen with French-specific filters (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ). Probably the reason is that the
queries and documents are matched in the same language or tokenization style, reducing noise
greatly and improving ranking quality.
• Without filters (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), irrelevant terms or tokens might be included in both indexing and querying
stages, causing mismatches and drops in performance compared to runs with filters applied.
• reRanking (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) uses more sophisticated models (e.g., DL or LLM-based) to reorder the top results,
improving the placement of highly relevant documents, but due to the fact that we set the
parameter maxDocsRetrieved to 100, we have a drop in metrics.
• synonyms+elision (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) ofers a moderate boost over noFilter, indicating that query expansion
helps but is not as impactful as filtering.
        </p>
        <p>
          The second graph (Fig.2) compares the diferent configurations of our system using various synonym
weights (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) for query expansion. Runs with weight = 0.5 achieve better results in terms of nDCG, while
increasing the weight leads to a drop in performance.
        </p>
        <p>
          The comparison also includes the runs with the frenchFilter (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) and the elisionFilter (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), the latter using
query expansion with a synonym weight of 0.5. Across all months in the collection, query expansion
results in a degradation of system performance when compared to the frenchFilter configuration. The
introduction of the ElisionFilter does not cause significant changes in the nDCG values compared
to the weight = 0.5 configuration, but performs slightly better on average in terms of nDCG, nDCG@10
and MAP.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results on the Test Collection</title>
        <p>
          This section outlines the results of the runs submitted to CLEF, presenting both raw performance scores
and accompanying statistical evaluations to allow for a clearer comparison of system behavior and
the impact of the implemented strategies. As in Section4.2, we show the graph that represents the
evolution of the nDCG over the months of the test collection, calculated for our four submitted systems
(Fig.3). This general comparison confirms that the ICU configuration (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) is the best in terms of nDCG.
In order to follow the initial LongEval task, which consists in the development of an IR system capable
of managing data changes over time, the following analysis will focus on three diferent snapshots of
the test collection: short-term (2023-03), middle-term (2023-06), and long-term (2023-08). For each of
the three cases, we compare our submitted systems with boxplots, to visualize their mean and variance.
After that, we compute the two-way ANalysis Of VAriance (2-ANOVA), considering as hypothesis
the null hypothesis 0:   =   (where  and  represent two generic systems to be compared), and
Multiple Comparisons with Tukey’s HSD Test. All tests are performed with a significance level  = 0.05.
        </p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Short-term analysis</title>
          <p>Just for the short-term analysis, we also show the results of the one-way ANOVA (Table 4). However,
we will focus only on the two-way ANOVA for the remaining tests. The reason for our choice is that
the two-way ANOVA considers both the topics and the systems’ variabilities, while the one-way
ANOVA only considers the system’s variability.</p>
          <p>Table 5 shows that both factors (systems and topics) are statistically significant (p &lt; 0.001), with
topics contributing substantially more to variance, indicating strong variability across queries. These
considerations lead to rejecting the null hypothesis.</p>
          <p>In Figure 5 we notice that the ICU system (in blue) is statistically diferent from the elision+synonyms0.5
and DL systems (in red), while it is not significantly diferent from the frenchfilter system (in gray).
In figure 4 we can see that all systems show relatively low medians, around 0.15–0.2. and The IQRs are
quite similar for all systems, indicating comparable variability.</p>
          <p>Each system has positive outliers (near 1.0), meaning that a few instances achieved high efectiveness.
mean_dif
q_stat</p>
          <p>p_value</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Middle-term analysis</title>
          <p>In this second experimental setting, the results confirm that both the systems and the topics significantly
afect performance, with p-values &lt; 0.0001 for both factors (Fig.7). In particular, the F-value for the
systems increased from 25.61 to 27.28, indicating a slightly stronger diferentiation among the systems
in this setting. In contrast, the F-value for the topics decreased from 2.79 to 2.26, though still statistically
significant. The Sum of Squares for topics (1051.42) remains considerably higher than that of systems
(5.83), showing that topic variability still accounts for the largest portion of the total variance. These
ifndings support the conclusion that system performance diferences are consistent and significant,
even when accounting for the large variance introduced by the topic dimension.</p>
          <p>F
p-value
In Figure 6 median values and overall IQRs are fairly consistent across months, and the number of
high-end outliers increased slightly, especially for elision+synonyms0.5, indicating more sporadic
high-performance cases in comparison with the previous boxplot.
mean_dif q_stat p_value</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>4.3.3. Long-term analysis</title>
          <p>The table 9 confirms a statistically significant diference among the systems and topics (p-values&lt;0.0001).
The Sum of Squares for topics is very high again (1456.32), which is consistent with the other tables (e.g.,
1051.42 in the middle-term analysis). However, because the number of topics (df = 10554) is larger, the
Mean Square is the smallest (0.1380) among all the diferent snaphots. The F-statistic for topics (2.1174)
is slightly lower than the other monthly snapshots (e.g., 2.7914 in short-term snapshot), suggesting a
significant efect of topic variability.
Finally in boxplot 8 we can also see that median values are lower than previous months, especially for
DL, which shows a very low central tendency, close to 0.</p>
          <p>All four systems have a large number of outliers, particularly concentrated near the top (nDCG near
the ’1’ value). This suggests rare but highly successful retrievals.</p>
          <p>The IQR is small, especially for DL, meaning most of its performance lies in a low range with little
variability — except for many strong outliers.
mean_dif q_stat p_value
This work presented a modular information retrieval system built on top of the Lucene framework,
enhanced with custom query formulation strategies and optional query expansion through Wolf, using
optionally deep learning or large language models for the re-ranking phase.</p>
          <p>The system demonstrated flexibility in handling TREC and JSON topics and produced standardized run
ifles suitable for evaluation.</p>
          <p>Initial experimentation were based on few filters configuration, then French-dedicated stoplist and
stemmer were introduced to improve our system performance.</p>
          <p>The implementation of a Deep Learning model for reranking seemed to be promising in terms of
nDCG@10, but its high computational cost forced us to retrieve a maximum of 100 documents per
query, and this led to a drop of nDCG .</p>
          <p>Future developments will focus on integrating deep learning models and large language models (LLMs)
for more efective and eficient query rewriting and document re-ranking.</p>
          <p>These enhancements aim to boost semantic understanding and retrieval precision.
Further work will also explore improved query expansion with synonyms, incorporating relevance
feedback mechanisms, a better management of short queries, and refining scoring models to better
capture user intent.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and Writefull in order to: Grammar
and spelling check, Paraphrase and reword. After using these tools/services, the authors reviewed and
edited the content as needed and take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bellin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Carè</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Pepaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salvalaio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Segala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tognon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          , SEUPD@CLEF:
          <article-title>Team GWCA on Longitudinal Evaluation of IR Systems by Using Query Expansion and Learning To Rank</article-title>
          , in: Conference and
          <article-title>Labs of the Evaluation Forum https://ceur-ws</article-title>
          .
          <source>org/</source>
          Vol-
          <volume>3497</volume>
          /paper-186.pdf,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>10084</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
          </string-name>
          , et al.,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.13971</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>