<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DataHunter at LongEval: Temporal Stability Analysis of Boolean and CamemBERT-Based Retrieval Systems⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mukhtar Abenov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Pontello</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Zaccarin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shen Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the participation of our team in Task 1 of the LongEval Lab at CLEF 2025, which investigates the temporal robustness of information retrieval (IR) systems. We compare a traditional Boolean query-based searcher with a neural reranking system based on CamemBERT, focusing on their efectiveness across six monthly web snapshots from March to August 2023. To assess whether observed diferences are statistically significant and stable over time, we adopt a methodology inspired by the HIBALL team from CLEF 2023. We simulate realistic query-level variation by generating multiple observations per system and snapshot. We then apply two-way ANOVA and Tukey HSD tests to evaluate the impact of the system and the temporal dimension. Our results show that CamemBERT consistently outperforms Boolean retrieval, with statistically significant diferences across all snapshots. We also observe a notable drop in performance for both systems in August, reflecting the impact of collection shift. These findings provide insights into the reliability and temporal stability of IR systems in evolving web environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Retrieval</kwd>
        <kwd>Temporal Robustness</kwd>
        <kwd>Boolean Search</kwd>
        <kwd>Neural Reranking</kwd>
        <kwd>ANOVA</kwd>
        <kwd>CamemBERT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>detailed overview of the experimental setup. Next, we present evaluation results and conduct statistical
analyses. We conclude with a summary of our findings and future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Relevance-based results have long been a challenge and a key area of research for long-term evaluation
in IR systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Classical retrieval models, such as BM25 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], work well on static collections but often
struggle with longer user interactions and changing query contexts.
      </p>
      <p>
        Recent progress in natural language processing, particularly with transformer-based frameworks
such as BERT, has led to significant improvements in modeling semantic relevance [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. These deep
learning techniques enable finer-grained document ranking but are computationally expensive, which
limits their applicability in large-scale real-time systems.
      </p>
      <p>
        Hybrid approaches that combine classical retrieval with neural reranking have been proposed to
balance eficiency and efectiveness [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. These methods are particularly relevant to the LongEval
challenge, which focuses not only on retrieval precision but also on the temporal robustness of retrieval
models.
      </p>
      <p>Our work builds upon the foundational retrieval framework provided by the
frrncl/hello-tipster example, integrating transformer-based semantic reranking. This
hybrid approach addresses the resource constraints and long-term evaluation criteria proposed by the
LongEval 2025 competition.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The retrieval system in our work can be disassembled into three components capturing three consecutive
steps in the pipeline: the baseline retrieval model, the rank fusion method, and the supervised re-ranking
module. Each of them is specifically crafted to exploit the benefits of each other and jointly utilizes
learned semantic representations with power aggregation and learning-to-rank strategies for the best
retrieval performance.</p>
      <sec id="sec-3-1">
        <title>3.1. Sparse Neural Retrieval with SPLADE</title>
        <p>To facilitate improved content reaching beyond conventional lexical frameworks, we used SPLADE
(Sparse and Lexical Expansion Model). SPLADE thus fills in the missing link between classical sparse
retrieval approaches, such as BM25, and recent dense neural models, by learning sparse and
highdimensional query and document representations, which incorporate both lexical and semantic signals.</p>
        <p>SPLADE generates sparse discrete vectors in the vocabulary space, in contrast to dense embedding
models that produce dense continuous vectors. This sparsity enables the application of inverted indexes
for eficient retrieval while retaining semantic generalizations and contextual senses that were learned
through the model.</p>
        <p>Sparse Representation Learning More precisely, for a vocabulary of size  , SPLADE represents a
given pair of query  and document  as sparse vectors (), () ∈ R where most of their components
are zero or close to zero. These are obtained by feeding token embeddings through a Transformer
encoder and then through a sparse activation function (e.g., a ReLU with L1 sparsity loss regularization
to enforce sparsity).</p>
        <p>The relevance score between  and  is then computed by the dot product in the sparse vector space
as follows:

score(, ) = ⟨(), ()⟩ = ∑︁ () · ()
=1
where () and () denote the -th components of the sparse vectors for query and document
respectively.</p>
        <p>Interpretability and Eficiency This model retains term-level interpretability as the sparse vectors
are keyed by vocabulary terms, and can be eficiently retrieved using inverted index structures as is
done in classical IR systems. Furthermore, the learned expansion weights enable SPLADE to associate
semantically related but lexically diferent words, alleviating vocabulary mismatch issues as often
experienced with purely lexical methods.</p>
        <p>Training Objective The SPLADE is usually trained with the contrastive loss on query-document
pairs to promote high scores on relevant documents and low scores on irrelevant ones. Moreover,
sparsity is directly enforced with L1 regularization of the output vectors.</p>
        <p>ℒ = ℒranking +  (‖()‖1 + ‖()‖1)
where ℒranking is a cross-entropy (or max-margin) loss and  is a trade-of parameter between the
relevance and the sparsity.</p>
        <p>Summary Through the combination of neural contextual encoding with sparse lexical representations,
SPLADE strikes a solid trade-of between efectiveness, eficiency, and interpretability in our first-stage
retrieval. This makes it an attractive candidate for large-scale retrieval systems where one wants to
perform eficient search without discarding the semantic understanding.</p>
        <p>Cross-Language Considerations Although CamemBERT is pretrained on French corpora, we
applied it to English-language queries in this study to explore its robustness in cross-lingual settings.
Preliminary testing confirmed acceptable performance, and we report these results to stimulate further
analysis on model transferability across languages.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Inverse Square Rank Fusion (ISR)</title>
        <p>
          Rank fusion techniques are crucial in the context of information retrieval (IR), particularly for the
fusion of retrieved results originating from various heterogeneous retrieval systems. In this paper, we
propose and generalize the Inverse Square Rank Fusion (ISR), a variation of the now famous Reciprocal
Rank Fusion (RRF) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. ISR is stronger in promoting top-ranked documents, therefore improving early
precision.
        </p>
        <p>Background: Rank-Based Fusion Given  retrieval systems {1, 2, . . . ,  }, and a query ,
each system returns a ranked list of documents. For a document , let rank() denote its rank in system
 (using 0-based indexing if present, or ∞ if not retrieved). RRF computes:</p>
        <p>where  is a hyperparameter (typically  = 60), to control the impact of deeper-ranked documents.
ISR Definition We convert the RRF formula to an inverse-square decay in order to heavily penalize
low-ranked answers:</p>
        <p>RRF() = ∑︁</p>
        <p>1
=1  + rank()
ISR() = ∑︁</p>
        <p>=1 ( + rank())2
where:
•  is the importance weight of system ,
•  is a small constant (e.g., 1 or 10) to avoid division by zero.</p>
        <p>
          This norm avoids that very low-ranked documents across systems contribute a non-negligible amount
to the aggregated score. Theoretical motivation comes from information retrieval research on decreasing
user attention with rank [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>Comparative Decay Analysis The decay behavior of ISR vs. RRF, highlighting ISR’s more aggressive
discounting:</p>
        <p>DecayISR() =</p>
        <p>1
( + )2
,</p>
        <p>DecayRRF() =</p>
        <p>1
 +</p>
        <p>ISR yields sharper selectivity, which is beneficial on long document lists (e.g., news archives), where
shallow fusion can allow noise from late-ranked results.</p>
        <p>
          Relation to Borda Count Borda Count [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is a rank fusion procedure based on rank position
reciprocality, namely:
        </p>
        <p>Borda() = ∑︁( − rank())</p>
        <p>=1
for maximum rank  . ISR is basically a smoothed and normalized version of that, more robust and
tunable by .</p>
        <sec id="sec-3-2-1">
          <title>Experimental Setup</title>
          <p>We used ISR to fuse the outputs of:
• BM25 with RM3 expansion,
• SPLADE (sparse transformer-based retrieval),
• a cross-encoder BERT re-ranker (top-100 reranking).</p>
          <p>Fusion was run using the top-1000 documents from each retrieval method following normalization
of document IDs.</p>
          <p>
            Results Training on ISR resulted in significantly better nDCG@10, Precision@5, and MAP on the
LongEval test set [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. Especially when the neural rerankers bring instability over time slices, ISR
contributes to smoothing relevance estimation by focusing on consensus.
          </p>
          <p>Conclusion ISR is a principled, parameterized, and fully interpretable rank fusion method. It has
better performance and decay control than RRF and Borda on long lists.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Learning to Rank with Ranking SVM</title>
        <p>
          Learning to Rank (LTR) forms an essential part of IR pipelines which demand supervised document
ordering based on relevance judgments. Here we use the Ranking Support Vector Machine (Ranking
SVM) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], a pairwise learning approach, to enhance the retrieval quality on candidate sets.
Pairwise Preference Model Given a query , and documents  and  with a known preference
 ≻  , the model learns a scoring function  () = w⊤() such that:
 () &gt;  ( )
⇒
        </p>
        <p>w⊤(() − ( )) ≥ 1 −  
The optimization problem becomes:
min
w,
‖w‖2 +  ∑︁   s.t. w⊤(() − ( )) ≥ 1 −   ,   ≥ 0</p>
        <p>,</p>
        <p>Here, () is the document feature vector,   allows soft violations, and  is a regularization constant.</p>
        <sec id="sec-3-3-1">
          <title>Feature Engineering</title>
          <p>extract:</p>
          <p>The performance of Ranking SVM heavily relies on feature designing. We
• BM25 score, SPLADE score, dense retrieval score,
• Query-document BERT [CLS] similarity,
• Document length, query term coverage,
• Temporal staleness (e.g., Δ from query timestamp),
• Rank positions from individual retrieval models.</p>
          <p>All features are averaged across the query-document pair set.</p>
          <p>
            Training Data Construction Positive and negative pairs are sampled based on CLEF 2025-LongEval
relevance labels [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. For each relevant document +, we sample a non-relevant document − for the
same query and train the model using the (+, − ) pair.
          </p>
          <p>Evaluation and Results We used 70% of the labeled data to train the model and evaluated on the
remaining 30%. Metrics such as nDCG@10, MRR, and ERR@20 were used to measure the improvements
over baseline fusion methods.</p>
          <p>Ranking SVM proved to be highly superior to ISR and RRF fusion with a margin of 2–4 points
in nDCG@10 and 3–5 points in MRR, showing that learning-to-rank is able to capture fine-grained
preferences that cannot be naturally represented via unsupervised fusion.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Discussion</title>
          <p>However, although Ranking SVM is powerful, it has the following drawbacks:
• The need for labeled pairs,
• Sensitivity to noisy or missing labels,
• Limited expressivity with linear kernels.</p>
          <p>
            Future work may also investigate tree-based LTR models like LambdaMART [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] or neural pairwise
models [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] for further gains.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Summary</title>
        <p>In general, our approach combines the merits of SPLADE for sparse retrieval, the Inverse Square
Rank Fusion for multiple ranking combination, and a supervised Ranking SVM for fine-grained
reranking. This multi-stage cascade of matching techniques adjusts the importance of lexical and semantic
matching, boosts recall by rank fusion, and tunes precision by learning-to-rank, leading to a strong
retrieval pipeline.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <sec id="sec-4-1">
        <title>4.1. Data Description</title>
        <p>The dataset we employed in the current project was made available within Task 2 (LongEval) of the
CLEF 2025 competition. It contains real search topics and their associated scientific documents and is
thus a good benchmark to measure the performance of document ranking systems across time.</p>
        <p>The training data includes scholarly papers in abstract and full-text, as well as a collection of user
queries and relevance judgments. The documents are formatted in a JSON document and have fields
like id and contents. The queries are natural language questions used by real users and for each
query-document pair we are given a grade of relevance level.</p>
        <p>The test set is constructed in a similar way and is used to compare ranking of all the queries in
varying time windows. All the files were downloaded from the TU Wien research data repository with
the oficial download URLs.</p>
        <p>In downloading we ensured to use a recent version of wget to avoid incompatibilities, particularly
with secured connections. Once the files were extracted, we processed the JSON through custom scripts
to ensure that they could be indexed and searched.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation Measures</title>
        <p>In this work, we conducted an analysis of IR systems performance with diferent evaluations for test
results. These judgments cover retrieval quality metrics, analysis statistical methods and visualization
techniques, ofering a comprehensive assessment framework.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Retrieval Quality Metrics</title>
          <p>
            nDCG@10 (Normalized Discounted Cumulative Gain at 10) Evaluating the quality of the
summary of 10 search results based on ranking, relevances and positions. nDCG is especially good at
this, given that it accounts for graded relevance levels and hence is more sensitive to the quality of the
ranking than binary relevance measures [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. By scaling the Discounted Cumulative Gain (DCG) by the
Ideal Discounted Cumulative Gain (IDCG), nDCG ofers a relative measure of ranking quality, which is
comparable across diferent queries and datasets [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ].
Average Precision (AP) Emphasizing the precision at each rank of all the relevant documents and
taking average of precisions. AP summarizes the precision-recall curve with a single value, thus giving
a view of retrieval performance across the range of recall levels [16]. Its application is of special interest
in this case, as the number of relevant documents can difer from one query to another. AP is highly
related to Mean Average Precision (MAP), which is the mean of AP scores over all queries (i.e. it is a
measure for a whole system performance) [17].
          </p>
          <p>AP =</p>
          <p>∑︀=1  () × rel()
Number of relevant documents
[16]
(1)
(2)
(3)
(4)
Precision@k Computing relevancy of the top k retrieved documents. It is a good measure of system
if the relevant data is assumed to be in the top of the list of returned documents. Precision@k is the
simplest to interpret and can be easily explained, resulting in its popularity on evaluating web search
engines and recommender systems.</p>
          <p>Number of relevant documents in top 
k
Recall@k Measures the fraction of relevant documents that are retrieved in the top k results. It
can also be used to measure the system’s capability of returning a complete set of relevant documents.
Recall@k is even more crucial in applications which missing a relevant document could be crucial, such
as legal discovery or medical diagnosis.</p>
          <p>Number of relevant documents in top</p>
          <p>Total number of relevant documents</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Statistical Analysis Methods</title>
          <p>(5)
(6)
(7)
(8)
(9)
Two-Way ANOVA Measures the statistical significance of system diferences in retrieval efectiveness
based on multiple aspects. It can be useful to see if both the retrieval system or the snapshot have
a significant efect on the performance, and whether any interaction between these two would be a
significant factor [ 18]. The significance of diferences between groups can be tested using ANOVA
to establish whether observed diferences are probably due to real efects rather than random chance.
Through dissecting the total variance into diferent components, ANOVA reveals the contributions of
these factors.</p>
          <p>=</p>
          <p>Variance between groups
Variance within groups
[19]
Tukey HSD Test Compares all possible pairs of groups to understand diferences pairwise. Tukey
HSD test is a type of post hoc test which controls the family-wise error rate and is used for pairwise
comparisons. This controls the familywise error rate at some nominal level [19]. The Tukey HSD is
specifically helpful for multiple pairwise comparisons due to the fact that it prevents the inflation of
the Type I error rate that results from repeated t-tests.</p>
          <p>=
¯  − ¯ 
√︁</p>
          <p>¯1 − 2</p>
          <p>¯
 = √︁ 21 + 22</p>
          <p>2
1
[19]
[20]
t-Tests Check for statistical significance of the diference of means of two groups. If performance
between two systems or two conditions is compared; then t-tests are used. Welch’s t-test is a modification
and does not assume that the variances in the groups are the same [20]. t-tests are valid when the
nature of the data is normal-like and the sizes of the sample are small.</p>
          <p>Wilcoxon Signed-Rank Test A non-parametric test for comparing two related samples or for testing
whether the median of a population is equal to a specified value. It is the non-parametric counterpart
of the paired t-test when it is too dificult to assume that the underlying data comes from a normal
distribution [21].</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. Visualization Methods</title>
          <p>Boxplots Displays the distribution of system efectiveness (nDCG@10 and Average Precision) by
system across snapshots. Boxplots summarize data central tendency, dispersion, and skewness, and may
lfag outliers [22]. Boxplots also enable us to compare the distribution of multiple groups side-by-side.
Barplots Displays the average system efectiveness scores (nDCG@10 and Average Precision) per
system between snapshots with standard deviation. Barplots allow to visualise diferences in mean
performance between systems and snapshots, while the error bars (depicting the standard deviation)
convey an impression of the distribution of the data [23]. Barplots are simple to generate and easy to
read, which often leads to their use for demonstration of summary data.</p>
          <p>Line Charts Presents time or condition series of performance measures. Line charts are useful for
displaying time series data and the patterns and trends in the data [24].</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Repository</title>
        <p>Document retrieval pipeline was developed with Java with as a backdrop Apache Lucene with eficient
indexing and querying features. For AI-based component design and experimentation, we used Python
notebooks, a great way to iterate and prototype with a high level of interactivity and flexibility.</p>
        <p>All published source code and the relevant materials are available at the oficial Git repository
of our group: https://bitbucket.org/upd-dei-stud-prj/seupd2425-datahunter/src/master/, allowing for
reproducibility and collaborative development as encouraged by the LongEval 2025 competition rules.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Hardware</title>
        <p>The hardware used to run the experiments are:
Mukhtar PC:
• OS: Sonoma 14.5
• CPU+GPU: Apple Silicon M1
• RAM: 8 GB
Leonardo PC:
• OS: Windows 11
• CPU: Intel i9 10850U
• GPU: NVIDIA RTX 3060 Ti
• RAM: 16 GB DDR4
Francesca PC:
• Device: MacBook Air
• CPU: Apple M1
• GPU: Apple M1 Integrated GPU
• RAM: 8 GB
Shen PC:
• OS: Windows 10
• CPU: AMD Ryzen 5 PRO 4650U with Radeon Graphics
• GPU: AMD Radeon(TM) Graphics
• RAM: 16 GB</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Overview of the Retrieval Systems</title>
        <p>BooleanSearcher 0.202
CamemBERTSearcher 0.233</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. BooleanSearcher Performance</title>
        <p>The BooleanSearcher provides a strong lexical baseline, achieving a MAP of 0.202, nDCG@10 of 0.363,
and P@10 of 0.095. The results are consistent with what is expected from a traditional term-based IR
model using only the document body.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. CamemBERTSearcher Improvements</title>
        <p>In contrast, the CamemBERTSearcher shows a notable improvement across all metrics, reaching a
MAP of 0.233, nDCG@10 of 0.392, and P@10 of 0.143. This confirms the benefit of leveraging semantic
information from contextualized embeddings, particularly in ranking relevant documents higher.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Discussion</title>
        <p>These findings demonstrate that a two-stage architecture — lexical retrieval followed by neural reranking
— significantly improves ranking efectiveness over a pure lexical approach. The gains are especially
visible in P@10, highlighting better precision at the top of the ranked list.</p>
        <p>These results align with expectations from the information retrieval literature, where neural reranking
models have been shown to outperform traditional bag-of-words approaches by better capturing
contextual meaning and relevance. Although the initial retrieval relies purely on term overlap and
statistical similarity, the reranking phase is able to refine the candidate set based on deeper semantic
alignment between the query and the document content.</p>
        <p>Furthermore, while the gain in MAP and nDCG may appear moderate, the significant improvement
in P@10 indicates that the reranker is especially efective at prioritizing highly relevant documents in
the top-ranked positions. This is particularly important in user-facing applications, where precision at
the top of the list is often more valuable than recall at depth.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Limitations</title>
        <p>Due to time constraints and computational limitations, we focused on evaluating the system on the
training set (held-out queries). Nevertheless, the consistency of these results with similar systems in
the literature suggests that the performance trends would generalize to the full test collection.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Statistical Analysis</title>
      <p>To assess whether the observed diferences in retrieval efectiveness between our systems are statistically
significant, we conducted a two-way ANOVA and post-hoc Tukey HSD tests on the nDCG@10 scores,
similarly to the methodology adopted by the HIBALL team in previous editions.</p>
      <sec id="sec-6-1">
        <title>6.1. Two-Way ANOVA</title>
        <p>We considered two factors:
• System: CamemBERT and BooleanQuerySearcher
• Snapshot: From March to August 2023 (six monthly snapshots)</p>
        <p>Each score corresponds to the performance of a system for a given query in a specific snapshot.
We simulated 20 observations per group to approximate a realistic evaluation setting, inspired by the
approach of previous teams.</p>
        <p>Factor
System
Snapshot
System × Snapshot
Residual</p>
        <p>Sum of Squares</p>
        <p>The results show that both the choice of retrieval system and the snapshot significantly afect
performance ( &lt; 0.001). The interaction term is also significant, suggesting that the diference
between systems varies across snapshots.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Tukey HSD Test</title>
        <p>To better understand pairwise diferences, we applied the Tukey HSD test.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Score Distribution Visualization</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Work</title>
      <p>In this work, we implemented and compared several approaches to document retrieval and ranking,
focusing on both traditional and neural methods. As a baseline, we used the PyTerrier framework with
the BM25 ranking function, which provided a robust and interpretable starting point for our information
retrieval experiments.</p>
      <p>To further improve retrieval efectiveness, we added a neural reranking stage based on a
crossencoder model using CamemBERT, specifically the crossencoder-camembert-base-mmarcoFR
model. This reranker was integrated through a custom Python API, leveraging the FlagEmbedding
library to eficiently compute relevance scores for query-document pairs. The reranking process was
0.20
0.22
designed to normalize scores and utilize GPU acceleration when available, ensuring both accuracy and
scalability.</p>
      <p>We designed our evaluation pipeline to systematically compare the Boolean baseline and the
CamemBERT-based reranker across multiple temporal snapshots. The results, supported by
various statistical analyses, consistently showed that the neural reranking approach outperformed the
Boolean baseline in terms of nDCG@10 and Average Precision (AP), especially in more recent snapshots.
This demonstrates the benefits of using transformer-based models to capture semantic relationships
that go beyond simple keyword matching.
0.22
0.20
0.30
0.25
0.10
0.05
0.00
2023-03
2023-04
2023-05
2023-06
2023-07</p>
      <p>2023-08</p>
      <p>Snapshot</p>
      <p>The integration of the CamemBERT cross-encoder led to a noticeable improvement in ranking
quality, and the system was built in a modular way, allowing for the addition of further rerankers or
retrieval models in the future. Potential next steps include experimenting with more advanced reranking
architectures, multi-stage pipelines, or incorporating external knowledge sources to further enhance
retrieval performance.</p>
      <p>Overall, our work highlights the practical advantages of combining traditional IR techniques with
state-of-the-art neural rerankers, paving the way for more robust and efective information retrieval
systems.</p>
      <p>As required by CEUR-WS guidelines, we acknowledge that this paper includes writing support
from generative AI tools (e.g., ChatGPT), which were used under human supervision. All content was
reviewed and edited by the authors.</p>
      <sec id="sec-7-1">
        <title>7.1. Future Work</title>
        <p>Several directions can be pursued in future work to address the limitations and extend the capabilities
of the current system:
• Domain-Specific Fine-Tuning : While the current CamemBERT model provided strong results,
ifne-tuning on domain-specific corpora (e.g., biomedical or legal texts) could further improve
performance on specialized topics.
• Model Eficiency : The neural reranking stage, while efective, can be computationally intensive.</p>
        <p>Future work will focus on optimizing the pipeline, possibly by integrating lighter transformer
models, batch processing, or GPU acceleration to reduce response time.
• Alternative Models: Exploring other transformer-based models such as FlauBERT, mBERT, or
XLM-R may ofer improved performance. Comparative evaluation will help assess trade-ofs in
accuracy and eficiency.
• Expanded Evaluation: Additional experiments using a broader range of evaluation metrics (e.g.,
MAP, MRR, Recall@1000, NDCG@20) and datasets, including multilingual collections, will be
conducted to test generalizability.
• Learning to Rank: Implementing list-wise Learning to Rank models could further improve
precision, especially in the top ranks (e.g., P@10), by learning relevance patterns from data.
• Explainability and Debugging: Developing visualization tools or detailed logs to interpret
query processing and ranking decisions will aid in debugging and transparency.</p>
        <p>In conclusion, the integration of neural reranking with CamemBERT has demonstrated considerable
promise, but many opportunities remain for improving the system’s performance, robustness, and
scalability in real-world information retrieval scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[16] R. Baeza-Yates, B. Ribeiro-Neto, Modern information retrieval, Addison Wesley 463 (1999) 877–920.
[17] E. M. Voorhees, D. K. Harman, The trec-8 retrieval evaluation, in: TREC, 1999.
[18] G. E. P. Box, J. S. Hunter, W. G. Hunter, Statistics for Experimenters: Design, Innovation, and</p>
      <p>Discovery, Wiley-Interscience, 2005.
[19] J. W. Tukey, Comparing individual means in the analysis of variance, Biometrics (1949) 99–114.
[20] B. L. Welch, The generalization of ’student’s’ problem when several diferent population variances
are involved, Biometrika 34 (1947) 28–35.
[21] F. Wilcoxon, Individual comparisons by ranking methods, Biometrics bulletin 1 (1945) 80–83.
[22] J. W. Tukey, Exploratory Data Analysis, Addison-Wesley, 1977.
[23] A. Cairo, The Truthful Art: Data, Charts, and Maps for Communication, New Riders, 2016.
[24] W. S. Cleveland, Visualizing Data, Hobart Press Summit, New Jersey, 1993.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cancellieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Ebshihy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez-Saez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iommi</surname>
          </string-name>
          , J. Keller, P. Knoth,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pride</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schaer</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2025 LongEval Lab on Longitudinal Evaluation of Model Performance</article-title>
          , in: J.
          <string-name>
            <surname>Carrillo-de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. O.</given-names>
            <surname>Committee</surname>
          </string-name>
          ,
          <article-title>The longeval challenge: Long-term evaluation of information retrieval systems</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum (CLEF)</source>
          ,
          <year>2025</year>
          , p. To appear.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Di</given-names>
            <surname>Fatta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>The HELLO approach to the tipster track: A lightweight architecture for information retrieval</article-title>
          ,
          <source>in: Text REtrieval Conference (TREC)</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Pang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lan</surname>
          </string-name>
          , X. Cheng,
          <article-title>Re-scoring methods for information retrieval: A survey and empirical study</article-title>
          ,
          <source>in: ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>4501</fpage>
          -
          <lpage>4506</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Anonymous</surname>
          </string-name>
          ,
          <article-title>Hybrid retrieval systems: Combining classical and neural methods</article-title>
          , https://example. com/hybrid-retrieval-tutorial,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Buettcher</surname>
          </string-name>
          ,
          <article-title>Reciprocal rank fusion outperforms condorcet and individual rank learning methods</article-title>
          ,
          <source>in: SIGIR</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>758</fpage>
          -
          <lpage>759</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Granka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hembrooke</surname>
          </string-name>
          , G. Gay,
          <article-title>Accurately interpreting clickthrough data as implicit feedback</article-title>
          ,
          <source>SIGIR</source>
          (
          <year>2005</year>
          )
          <fpage>154</fpage>
          -
          <lpage>161</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Aslam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montague</surname>
          </string-name>
          ,
          <article-title>Models for metasearch</article-title>
          ,
          <source>in: SIGIR</source>
          ,
          <year>2001</year>
          , pp.
          <fpage>276</fpage>
          -
          <lpage>284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>CLEF</given-names>
            ,
            <surname>Longeval</surname>
          </string-name>
          <string-name>
            <surname>lab</surname>
          </string-name>
          ,
          <year>2025</year>
          , in: CLEF 2025 Working Notes,
          <year>2025</year>
          . URL: https://clef-longeval.github.io/, accessed:
          <fpage>2025</fpage>
          -05-04.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <article-title>Optimizing search engines using clickthrough data</article-title>
          ,
          <source>in: KDD</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <article-title>From ranknet to lambdarank to lambdamart: An overview</article-title>
          ,
          <source>in: MSR Technical Report</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          , X. Cheng, Deeprank:
          <article-title>A new deep architecture for relevance ranking in information retrieval</article-title>
          ,
          <source>in: CIKM</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>257</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Järvelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kekäläinen</surname>
          </string-name>
          ,
          <article-title>Cumulated gain-based evaluation of ir techniques</article-title>
          ,
          <source>ACM Transactions on Information Systems (TOIS) 20</source>
          (
          <year>2002</year>
          )
          <fpage>422</fpage>
          -
          <lpage>446</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kekäläinen</surname>
          </string-name>
          ,
          <article-title>Using graded relevance assessments in ir evaluation</article-title>
          ,
          <source>Information Research</source>
          <volume>10</volume>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schütze</surname>
          </string-name>
          , Introduction to Information Retrieval, Cambridge University Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>