<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SEUPD@CLEF: Team BASET TE at LongEval: IR System for Basic Hardware⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alberto Bottari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Croce</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fatemeh Mahvari Habib Abadi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padua</institution>
          ,
          <addr-line>Via Gradenigo 6/B, 35131 Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This report describes the system developed by Team BASETTE for the CLEF 2025 LongEval Lab, Task 1 Web Retrieval. Our main priority was optimizing performance on limited and commonly available hardware, deliberately avoiding the use of GPUs or other specialized computational resources. The system relies on classical Information Retrieval techniques, and is designed to run both indexing and retrieval in a multithreaded fashion to ensure high execution speed. During development, we explored various strategies, some of which were discarded not only due to limited efectiveness, but also because their processing time was not compatible with our eficiency constraints.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Retrieval</kwd>
        <kwd>CLEF 2025</kwd>
        <kwd>LongEval</kwd>
        <kwd>Web Search</kwd>
        <kwd>Resource-Constrained Systems</kwd>
        <kwd>Classical IR Techniques</kwd>
        <kwd>Multithreaded Retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>In the following subsections, we provide a detailed breakdown of our methodology. We begin by
describing the architecture of a classical IR system, detailing each of its main modules: parser, preprocessor,
analyzer, indexer, and searcher. Next, we introduce our multithreading strategy, which maximizes
performance under constrained hardware conditions. We then discuss how we leveraged automated
hyperparameter optimization using Optuna to fine-tune our system. Subsequently, we highlight several
approaches that were explored but ultimately discarded, either due to insuficient efectiveness or
excessive computational cost. Finally, we explain how the system’s configuration works.</p>
      <p>At the core of the system is the InformationRetrievalSystem class, which integrates all major
components of the pipeline, from preprocessing and parsing to indexing and searching, coordinating
their interaction throughout the retrieval process. The system is launched via the Main class, which
handles command-line argument parsing and initializes the configuration environment.</p>
      <sec id="sec-2-1">
        <title>2.1. Document Parsing</title>
        <p>The document parsing process involves specialized parsers such as the DirectoryDocumentParser
and JsonFileDocumentParser. The DirectoryDocumentParser recursively scans directories,
processing each file as an individual document and filtering out unwanted files using user-defined
patterns. The JsonFileDocumentParser handles JSON files containing arrays of documents,
converting them into a structured format suitable for indexing. The parsed documents are converted into
ParsedDocument objects, containing content and metadata for indexing. This approach allows for
extendible integration of new parsers as needed, supporting various document formats and sources.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Query Parsing</title>
        <p>Query parsing is the stage where raw query inputs are read, interpreted, and transformed into structured
query objects suitable for execution by the search engine. This functionality is abstracted by the
IQueryParser interface and implemented in one or more concrete classes. Each query is converted
into a Lucene Query object and wrapped with metadata for tracking and scoring. This design allows
the parser to be used in iterator-style loops and ensures that both the structured query and its metadata
(e.g., query ID and original text) are available to downstream components. The primary implementation
of IQueryParser in the system is TxtQueryParser. It reads a plain text file where each line contains
one query, usually in the format:</p>
        <p>&lt;query-id&gt;\t&lt;query-text&gt;</p>
        <p>The RawQuery class encapsulates metadata associated with a query. It typically includes the qid,
which is the unique identifier of the query, and the text, which is the raw textual form of the query as
provided in the input file. The query text is then tokenized and processed using the configured analyzer,
and a corresponding Lucene Query object is generated. The parsed query text may be used to construct
various types of Lucene queries, including term queries, phrase queries, and boolean queries. The final
Lucene Query object is passed to the searcher.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Text Preprocessing</title>
        <p>The system provides a mechanism to apply a chain of transformations to input queries or documents
before tokenization. This is achieved using the IPreProcessor interface, which defines a single
process(String query) method. Two notable implementations include the UnicodeNormalizer
PreProcessor, which applies Unicode normalization, and the RegexPreProcessor, which applies
regular expression substitutions using a predefined pattern and replacement string.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Text Analysis</title>
        <p>The text analysis phase transforms raw textual input into a stream of tokens that can be indexed and
later retrieved. The analysis is performed through the GenericAnalyzer class, the main analyzer
component, which is dynamically assembled using configuration files specified via implementations of
the IAnalyzerConfig interface. GenericAnalyzer follows a pipeline composed of:
• Tokenizer: Responsible for splitting the input text into raw tokens.
• Token Filters: Sequentially modify, normalize, or enrich the token stream.
• Stemmer: This component is in charge of removing sufixes and word endings. Even if in Lucene
it’s implemented as a token filter the importance of this component is high so we decided to make
it explicitly configurable.</p>
        <sec id="sec-2-4-1">
          <title>2.4.1. Tokenizer Configurations</title>
          <p>Our system supports three tokenizer configurations, each defining how the input text is split into
tokens:
• StandardTokenizerConfig: Uses Lucene’s standard tokenizer, which follows Unicode Text
Segmentation rules. It handles punctuation, digits, and word boundaries in a language-aware
manner.
• LetterTokenizerConfig: Emits tokens consisting only of alphabetic characters. Any
nonletter character (e.g., digits, punctuation) is treated as a delimiter.
• WhitespaceTokenizerConfig: Splits the input text strictly on whitespace characters. It does
not handle or remove punctuation or symbols, those are retained as part of the tokens.</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>2.4.2. Token Filters</title>
          <p>We have utilized Token Filters to expand and modify the token streams of both queries and documents.
These filters are categorized into agnostic and language-specific types, and they are applied after
tokenization.</p>
          <p>Agnostic Token Filters These filters are language-independent and perform general text
normalization or enhancement operations:
• LowerCaseFilterConfig: Converts all tokens to lowercase, ensuring case-insensitive search.
• ASCIIFoldingFilterConfig: Transforms accented and special characters into their closest</p>
          <p>ASCII equivalents.
• TrimFilterConfig: Removes any leading or trailing whitespace around each token.
• LengthFilterConfig: Discards tokens whose characters count is not in a certain (configurable)
range .
• RegexFilterConfig: Uses regular expressions to remove or modify tokens based on pattern
matching, for example html tags or not Unicode characters.
• SpellCheckerFilterConfig: Applies Lucene‘s spell-checking module to suggest corrections
for tokens using the index.</p>
        </sec>
        <sec id="sec-2-4-3">
          <title>Language-Specific Token Filters</title>
          <p>designed for either English or French:</p>
          <p>These filters handle language-specific linguistic features and are
• FrenchStopFilterConfig: Removes common French stopwords such as “le”, “de”, and “et”.
• FrenchElisionFilterConfig: Removes French elisions in contractions (e.g., “l’homme”
becomes “homme”).
• EnglishStopFilterConfig: Removes frequent English stopwords such as “the”, “and”, and
“of”.
• EnglishPossessiveFilterConfig: Removes possessive sufixes (e.g., “company’s” becomes
“company”).</p>
        </sec>
        <sec id="sec-2-4-4">
          <title>2.4.3. Stemming</title>
          <p>Stemming is handled via Lucene’s built-in filters and configured directly within the analyzer. For English,
the system supports minimal stemming, which reduces words to their root forms (e.g., "running" to
"run") while avoiding overly aggressive normalization that might merge distinct terms.</p>
          <p>For French, the system ofers a broader range of options. It supports minimal stemming for light
normalization using the FrenchMinimalStemFilter, a more conservative approach through the
FrenchLightStemFilter, and a more aggressive strategy using the FrenchSnowballStemFilter
with the French Snowball algorithm.</p>
        </sec>
        <sec id="sec-2-4-5">
          <title>2.4.4. Word N-grams</title>
          <p>
            In our system, we initially applied Lucene’s ShingleFilter, a TokenFilter that generates n-grams—
sequences of adjacent tokens—to enhance textual representation. This approach aimed to improve recall
by capturing short-range context and enabling the system to match not just isolated terms but also
common word combinations and phrases [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. We applied the ShingleFilter during both indexing
and query processing: at indexing time, to enrich document representations by storing word n-grams;
and at query time, to perform query expansion by appending n-gram variants to the original query
terms.
          </p>
          <p>However, this naive n-gram indexing method revealed significant performance issues. The number
of generated n-grams grows combinatorially with document length, causing both storage overhead and
severe slowdowns during indexing due to the bloated inverted index. To mitigate this ineficiency, we
adopted a refined strategy that avoids indexing all possible n-grams explicitly.</p>
          <p>Instead, the updated system uses Lucene’s phrase query capabilities to dynamically generate n-gram
matches at query time. Rather than storing n-grams in the index, we now tokenize and index only
unigrams but construct PhraseQuery objects from the original query text when searching. This
approach maintains the benefits of contextual matching while reducing index size and indexing time.
By leveraging Lucene’s eficient positional indexing and skipping the storage of redundant n-gram
entries, we achieve comparable retrieval efectiveness with much better performance.</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Parallel Computing</title>
        <p>
          An essential way to improve the performance of a search engine is not only by optimizing how queries
are executed but also by addressing the time required to index documents [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The eficiency of
both the indexing process and the execution of user queries is critical, particularly as the volume of
documents grows. In our search engine project, we realized that running numerous experiments with
significant hardware constraints required careful attention to the time and resources consumed in these
operations. To mitigate performance bottlenecks, we focused on optimizing both the indexing and
searching processes using parallelism. The following sections details how parallelism was implemented
to streamline the indexing and searching processes.
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>2.6. Indexing</title>
        <p>2.6.1. Indexer
The indexing stage is responsible for transforming processed documents into an inverted index using
Lucene. This index enables eficient retrieval of relevant documents in response to user queries. The
indexing logic is encapsulated in the Indexer class, which is instantiated at runtime based on external
configuration.</p>
        <p>The Indexer class serves as the core component responsible for the indexing workflow. Its main
function is to iterate through all parsed documents, apply the configured analyzer to the document
content, and write the resulting tokens into a Lucene index.</p>
        <p>
          The indexer operates on a designated input directory and outputs to a target index directory. It is
responsible for several tasks related to indexing, including initializing a Lucene IndexWriter with the
appropriate configuration, processing each document using the provided analyzer, and storing fields
such as the document ID and body content. Additionally, it records term positions and frequencies to
support proximity queries and scoring models like BM25 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The indexer also logs important statistics,
including the number of files processed and the time taken for the indexing process.
        </p>
        <sec id="sec-2-6-1">
          <title>2.6.2. Concurrent Indexing Process</title>
          <p>In this project, parallelism in the indexing process is implemented using a producer-consumer model. The
producer threads are responsible for parsing documents, while the consumer threads handle the indexing
of these parsed documents into a Lucene index. Once a document is parsed into a ParsedDocument
by a producer, it is placed into a shared BlockingQueue, where it will be consumed. For this parallel
indexing system, we use an empirical ratio found to perform well on our machine: 2/5 of the threads
are allocated to document parsing (producers), and 3/5 to document indexing (consumers).</p>
          <p>After all documents are processed, a special POISON marker is used to signal the end of the input
stream and gracefully terminate consumer threads.</p>
          <p>During these experiments, we also identified that one of the most significant bottlenecks was the
commit-to-disk phase during indexing. On systems equipped with suficient RAM (at least 32 GB in our
tests), we enabled an in-memory indexing mode where the index was never written to disk. Instead,
it remained entirely in RAM throughout the process. This approach reduced the total indexing time
by nearly half, enabling a much higher number of experiments within the same time window. The
decision to use disk-based or in-memory indexing remains configurable through the system’s JSON
configuration.</p>
        </sec>
        <sec id="sec-2-6-2">
          <title>2.6.3. Index Field Configuration</title>
          <p>The indexer writes each document’s content into a dedicated Lucene field, typically named BODY_FIELD.
This field is configured to store term frequency and position information, which is used for ranked
retrieval and phrase queries. Additionally, it can optionally store the original body text, which may be
useful in downstream components such as rerankers.</p>
        </sec>
      </sec>
      <sec id="sec-2-7">
        <title>2.7. Searching</title>
        <p>The searching stage is responsible for executing user queries over the previously created Lucene index
and retrieving the most relevant documents. This process is implemented through the Searcher class,
which orchestrates multithreaded query execution and writes ranked results in TREC run format.</p>
        <p>The Searcher class serves as the main retrieval engine and is instantiated with several components.
These include a Lucene IndexReader for accessing the index, a similarity model (such as BM25) to
score documents, and a query parser that generates Lucene queries from raw query text, as discussed
in Section 2.2.</p>
        <sec id="sec-2-7-1">
          <title>2.7.1. Concurrent Searching Process</title>
          <p>Initially, the search process followed a producer-consumer architecture to parallelize query processing
across multiple threads. The QueryProducerTask ran in a dedicated thread that parsed and converted
queries into Lucene-compatible objects wrapped in QueryWrapper instances. These were pushed into
a shared blocking queue. Multiple QuerySearchTask threads then consumed the queries, executed
searches via Lucene’s IndexSearcher, and handled result formatting and output. A poison-pill
mechanism was employed to signal the end of the query stream.</p>
          <p>However, as the complexity of query generation increased—particularly with the addition of temporal
reasoning and semantic rewriting—the producer-consumer model introduced subtle concurrency bugs
that were dificult to debug and resolve. To simplify the control flow and reduce synchronization
overhead, we migrated to a single-task parallel model. In this design, a single orchestrator task internally
handles both query parsing and execution, distributing work across threads without the need for an
intermediate queue.</p>
          <p>Despite the structural simplification, we observed no drop in performance. CPU utilization remained at
100% throughout the search process, confirming that all available threads were efectively leveraged. This
refactoring improved maintainability while preserving high throughput during the query evaluation.</p>
        </sec>
        <sec id="sec-2-7-2">
          <title>2.7.2. Similarity Model</title>
          <p>
            The system uses Lucene’s BM25 [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] implementation as the default similarity model. Key parameters
such as 1 (term frequency saturation) and  (length normalization) are configurable.
          </p>
          <p>Additionally, the retrieval logic includes a configurable thresholding mechanism to filter out
lowquality results. Although the system may retrieve up to a fixed number of documents (e.g., 100), it
evaluates each result relative to the top-ranked document. If a candidate’s score falls below a configurable
percentage of the highest score (e.g., less than 50% of the top score), it is discarded. This helps eliminate
documents that are technically within the result limit but are likely to be irrelevant, improving the
overall precision of the final ranked list.</p>
        </sec>
      </sec>
      <sec id="sec-2-8">
        <title>2.8. Hyperparameter Optimization with Optuna</title>
        <p>
          To explore the space of possible configurations and improve the overall performance of our retrieval
system, we used Optuna [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], a Python library for hyperparameter optimization designed to automatically
search for good parameter combinations. We chose it because it is easy to use, highly customizable, and
ifts naturally into our architecture where the system behavior is driven by a set of JSON configuration
ifles.
        </p>
        <p>In our setup, visible in figure 2, we treat each run of the search engine as a trial. For every trial,
Optuna suggests a combination of parameters that define how the system behaves during indexing and
search. These include:
• BM25_PARAM_k1 → controls term frequency saturation in BM25.
• BM25_PARAM_b → controls length normalization in BM25.
• STEM_FILTER_TYPE → the type of stemming applied during indexing (e.g., french_minimal).
• STOP_FILTER_TYPE → stopword filtering strategy (e.g., built-in or generic).
• STOP_FILTER_FILEPATH → external stopword list file used if applicable.
• LENGTH_MIN_LENGTH → minimum term length allowed in the index.
• LENGTH_MAX_LENGTH → maximum term length allowed in the index.
• TEMPORAL_BOOST → weight applied to the temporal similarity score.
• NGRAM_BOOST → weight applied to the n-gram matching score.
• SCORE_THRESHOLD → minimum score a document must have to be considered relevant.
• MIN_NGRAM_SIZE → minimum size of n-grams used in the query expansion.
• MAX_NGRAM_SIZE → maximum size of n-grams used in the query expansion.
• MAX_DOC_RETRIEVED → maximum number of documents retrieved per query.
• PROXIMITY_RERANKER_MAX_SLOP → maximum distance 2 words can have for the proximity
reranking.
• SPELLCHECKER_NUMBER_OF_SUGGESTIONS → number of spell correction suggestions
considered.
• SPELLCHECKER_MIN_TERM_LENGTH → minimum length of a word to be spell-checked.
• BM25_WEIGHT → the weight of the bm25 for document retrieval
• PROXIMITY_WEIGHT → the weight of proximity search for document retrieval.
These parameters are inserted into two template files, indexer.json and searcher.json, replacing
placeholder values.</p>
        <p>Once the JSON files are generated, the system launches the search engine with these congfiurations
across snapshots. After each run, the results are evaluated using trec_eval, and a custom score is
computed based on the output. Optuna also takes care of managing the history of all experiments: it
keeps track of which parameter combinations have been tried, how well they performed, and which
trial produced the best result so far. More importantly, it supports an early stopping mechanism called
pruning, which allows the system to interrupt trials that appear unpromising before they complete all
evaluations.</p>
        <p>To address this, we compute a custom score that combines multiple aspects of performance. First,
we calculate the average nDCG over the document snapshots evaluated so far, to capture overall
efectiveness. Then, we assess the relative drop in performance between consecutive snapshots to
penalize instability over time. These components are combined into a single score that rewards both
high efectiveness and stable behavior.</p>
        <p>Formally, the score  used during optimization is defined as:</p>
        <p>
          =  · mean() − (1 − ) · std()
where  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is a tunable parameter (configured via the .env file) that controls the trade-of
between average efectiveness and result stability.
        </p>
        <p>As the optimization proceeds, Optuna continuously explores the parameter space and learns which
configurations are more efective. The figure 3 shows the progression of average nDCG values across
trials, highlighting how the system gradually converges toward more efective and stable configurations.
It’s worth noting that, before the system reaches its optimal and stable state, there is an intermediate
period of instability, likely caused by Optuna’s extensive exploration of the parameter space.</p>
      </sec>
      <sec id="sec-2-9">
        <title>2.9. Discarded Approaches</title>
        <p>Throughout the development process, we investigated several techniques that initially appeared
promising for improving system performance. However, due to technical limitations, inadequate lexical
coverage, or integration complexity, some of these approaches did not yield the expected results. The
following sections briefly describe these attempts and the reasons why they were ultimately abandoned.</p>
        <sec id="sec-2-9-1">
          <title>2.9.1. Semantic Expansion with WordNet</title>
          <p>
            To enhance semantic understanding during query processing, we explored the use of WordNet [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ], a
large lexical database of English developed at Princeton University. WordNet organizes nouns, verbs,
adjectives, and adverbs into sets of cognitive synonyms (synsets), each expressing a distinct concept. Its
network structure of semantic relationships—such as synonymy, hypernymy, and meronymy—makes it
a valuable resource for query expansion and contextual matching in information retrieval systems.
Goal and Integration Attempt Our goal was to use WordNet to expand query terms with
semantically related alternatives (e.g., synonyms, hypernyms), ideally improving recall for queries that
might otherwise be too narrow or under-specified. Integration into our Lucene-based system was
envisioned through a token filter pipeline that would enrich tokens at query time using synsets retrieved
from WordNet. We implemented a WordnetTokenFilterConfig component and began testing this
expansion strategy.
          </p>
          <p>To access WordNet programmatically, we relied on extJWNL [9], a Java API that facilitates interaction
with WordNet dictionaries and synsets. It provided a convenient way to retrieve semantic relations for
English terms and integrate them into the Lucene pipeline.</p>
          <p>However, our dataset was predominantly in French, so we sought a compatible lexical resource in that
language. The most promising candidate was WOLF (WordNet Libre du Français) [10], an open-source
French WordNet project built to mirror the structure of the original Princeton WordNet.
Limitations and Abandonment Unfortunately, WOLF posed several challenges. First, its format
was not compatible with the most common Java-based WordNet access libraries such as extJWNL.
Adapting the data for these tools would have required extensive manual transformation or building
custom parsers. Furthermore, WOLF’s coverage and structural consistency were limited compared to
the English WordNet. Many French terms lacked synsets or had only shallow hierarchies of related
concepts.</p>
          <p>After multiple attempts to preprocess and align the WOLF dataset to our tokenizer and filter pipeline,
we were unable to achieve a working integration. No synonym expansion was successfully applied in
practice.</p>
          <p>Conclusion Given these technical barriers and the lack of tooling for French WordNet resources,
we ultimately decided to abandon this route. Nonetheless, we believe that WordNet-based semantic
expansion remains a promising strategy for IR, particularly in monolingual English scenarios or when
higher-quality lexical ontologies are available for the target language.</p>
        </sec>
        <sec id="sec-2-9-2">
          <title>2.9.2. Query Expansion via Synonym Dictionaries</title>
          <p>In an attempt to improve recall on short or under-specified queries, we experimented with a query
expansion strategy based on synonym substitution. The idea was to extend queries with semantically
related terms, increasing the likelihood of retrieving documents that used alternative wordings. This
approach is well-established in information retrieval, particularly when dealing with sparse queries or
lexical variability between user language and document language.</p>
          <p>Dictionary-Based Expansion To implement this idea, we adopted an open-source synonym
dictionary from a GitHub repository [11]. Specifically, we extracted and converted the core data contained in
the dictionary.go file, which maps base terms to lists of synonyms. We reformatted this data into a
format compatible with Lucene’s SynonymFilter, allowing integration into our analyzer pipeline.</p>
          <p>The expansion logic was applied selectively: if a query was deemed too short (e.g., less than a
configurable number of tokens), we automatically augmented it by adding one synonym per word,
based on dictionary availability. If the query remained short even after the first pass, additional rounds
of expansion were performed iteratively, appending further synonyms where possible.
Evaluation and Limitations Although this strategy was theoretically sound and straightforward to
implement, it failed to produce the expected improvements in retrieval quality. Upon closer inspection,
we found that the synonym pairs in the dictionary were often weakly related or even misleading. Many
base terms had no meaningful synonyms, and in other cases the listed synonyms introduced semantic
drift, retrieving documents that were topically unrelated to the original query intent.</p>
          <p>Rather than boosting recall, the expansion often diluted query specificity, resulting in lower precision.
This outcome was disappointing, especially considering the added complexity and overhead introduced
in the query processing pipeline.</p>
          <p>Conclusion We ultimately decided to abandon this synonym-based expansion strategy. However,
our conclusion is not a rejection of the approach itself, but rather a reflection of the limited quality
of the available lexical resource. We remain confident that, given a richer and more context-aware
synonym dictionary, ideally one derived from a semantic model or manually curated lexicon, query
expansion could be an efective tool in improving retrieval performance.</p>
        </sec>
        <sec id="sec-2-9-3">
          <title>2.9.3. Date Normalization with Duckling</title>
          <p>To handle temporal information in both documents and queries, we initially integrated Duckling [12],
a library for recognizing and normalizing date expressions. The goal was to better align documents
and queries referring to the same time periods, even if expressed diferently (e.g., last summer” vs. July
2022”).</p>
          <p>Duckling, developed by Facebook AI Research, is an open-source system designed to extract structured
data such as dates, durations, numbers, and quantities from natural language text. One of its key
advantages is its multilingual capability, including support for English and French, which made it
attractive for our multilingual setup. Another important factor in its initial selection was its availability
as a self-contained, Dockerized microservice [13], allowing easy integration into our Java-based system
via HTTP APIs returning structured JSON.</p>
          <p>Architecture and Optimization Our first implementation used a microservice architecture:
Duckling ran locally as a Docker container exposing an HTTP API. During indexing, each document was sent
individually to Duckling, which returned all detected temporal expressions. From these, we computed
the median timestamp and stored it in a dedicated Lucene field. At query time, the same process
was applied: queries were passed to Duckling, and if temporal expressions were found, the resulting
normalized timestamp was used to boost matching documents occurring around the same time.</p>
          <p>Although conceptually elegant, this solution rapidly became a major performance bottleneck,
particularly during indexing, where millions of HTTP calls to Duckling introduced significant overhead.
To address this, we moved date extraction to a preprocessing step. We extracted timestamps in batch
mode and built a map of doc_id → median_timestamp, which was then passed to the indexer. This
drastically improved performance and allowed timestamp reuse across runs, as long as the documents
remained unchanged.</p>
          <p>Instability and Abandonment Despite these optimizations, we encountered stability issues. In
particular, some queries containing a high density of temporal expressions on very closely spaced
date mentions would cause Duckling to crash, terminating the processing thread. After inspecting the
problem, we observed that Duckling consistently failed when parsing certain complex queries with
many adjacent or overlapping temporal entities.</p>
          <p>We attempted to bypass the Docker abstraction and compiled Duckling natively from source using
Haskell but the crashes persisted. At this point, we conducted a statistical analysis of the dataset and
found that only approximately 2% of all queries actually contained temporal expressions. Given this low
impact on the overall retrieval process and the persistent technical issues, we concluded that Duckling’s
inclusion did not justify its cost in terms of system complexity and stability.</p>
          <p>Consequently, we decided to permanently abandon Duckling from our system even if the concept of
temporal alignment still remains a possible relevant solution in a IR system.</p>
        </sec>
        <sec id="sec-2-9-4">
          <title>2.9.4. Neural Reranking with CamemBERT</title>
          <p>In an efort to improve result ranking quality beyond traditional lexical matching, we experimented
with a reranking phase based on CamemBERT [14]. CamemBERT is a transformer-based language
model specifically trained for the French language. It is based on the RoBERTa architecture and was
trained on a 138GB corpus composed of high-quality French text sources, including OSCAR, CCNet,
and Wikipedia. Its monolingual training makes it particularly well-suited for semantic understanding
in French retrieval tasks.</p>
          <p>Usage and Implementation We employed CamemBERT to rerank the top- documents retrieved
via BM25. For each query-document pair, embeddings were computed and cosine similarity was used
to reorder documents by semantic proximity.</p>
          <p>To run the model within our Java-based search system, we used the Deep Java Library (DJL)
[15], a high-level, engine-agnostic deep learning framework for Java. DJL allows direct integration of
pre-trained neural models into Java applications, and supports backends like PyTorch, TensorFlow, and
ONNX. However, DJL expects models to be in the PyTorch .pt format for PyTorch-based inference.</p>
          <p>Since the oficial CamemBERT model available from Hugging Face [ 16] is provided in the Transformers
format (with separate configuration and weights files), we had to download it manually and convert
it into a serialized .pt format compatible with DJL. A Python script using the transformers and
torch libraries handled the conversion and model tracing, enabling us to load it directly via DJL at
runtime.</p>
          <p>Performance Limitations and Abandonment While the reranking phase showed improvements
in output quality, particularly when initial BM25 scores failed to capture deeper semantic relevance, the
computational overhead was prohibitive. In the absence of a GPU, all inference operations ran on CPU,
resulting in severe performance degradation. Reranking even a small batch of candidate documents
(e.g., top-10) per query led to a time increase of approximately 8000–10000% compared to the baseline
BM25-only pipeline. If we sum this with the fact that a single snapshot consisted of approximately of
75,000 queries we concluded that under these conditions the total runtime for a full evaluation became
unmanageable.</p>
          <p>Given that our system was designed with performance and large-scale tunability in mind, we
concluded that this approach was not viable within our project’s constraints. Despite its potential
benefits in terms of ranking quality, the cost in execution time was too high, and we ultimately decided
to discard the BERT-based reranking phase from the final version of the system.</p>
          <p>We plan to revisit this direction in the future under better hardware conditions, as transformer-based
reranking remains one of the most efective techniques for semantic matching.</p>
        </sec>
      </sec>
      <sec id="sec-2-10">
        <title>2.10. System Configurability</title>
        <p>A key design goal of our system was to ensure that all behavior-critical decisions could be modified
without altering the Java source code. To this end, we encapsulated every tunable subsystem within
serializable configuration object, organized under the de.ir_lab.longeval25.config package
hierarchy.</p>
        <p>Each configuration object implements a corresponding factory interface and is annotated to support
deserialization via Jackson [17]. This allows the appropriate Java classes to be instantiated at runtime
based solely on a type field in a JSON file.</p>
        <p>At runtime, the main driver reads a user-specified configuration file (e.g., searcher.json) from the
folder provided via the command line. This JSON file is then parsed using a shared ObjectMapper,
which deserializes it into the corresponding configuration object.</p>
        <p>During deserialization, runtime dependencies, such as command-line Parameters, are injected
automatically using the @JacksonInject annotation. Once the configuration object is fully constructed,
its toRuntime() method is called to produce the actual working component, ready for execution.</p>
        <p>This level of configurability was fundamental to our workflow, as it enabled integration with automatic
hyperparameter tuning tools like Optuna. Thanks to the JSON-based setup, experiments could be rapidly
iterated without any need for recompilation or repackaging of the system. For example, switching from
one type of stopword list to another (e.g., from a minimal to an aggressive set) requires only a one-line
change in the configuration JSON file.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setup</title>
      <p>The experimental conditions under which our information retrieval system was developed, deployed,
and evaluated are as follows:
• Used Collections: All experiments were conducted using the LongEval CLEF 2025 Lab dataset
[18], which included documents in various formats (such as .json,.trec, etc.), queries in plain
text format, and the corresponding relevance judgement files.
• Evaluation Measures: System performance was assessed using standard information retrieval
metrics included in the trec_eval tool.
• Source Code Repository: The full source code is available at: https://bitbucket.org/
upd-dei-stud-prj/seupd2425-basette
• Hardware Used for Experiments: All experiments were executed on a high-performance
server with the following specifications:
– CPU: AMD Ryzen 9 5950X (16 cores, 32 threads)
– RAM: 128 GB DDR4
– Storage: 2 × 4 TB NVMe SSDs
– GPU: Not utilized
– Operating System: Ubuntu 24.04.2 LTS
• Software Environment: The system was run with the following environment:
– Java 17 (OpenJDK)
– Maven 3.8.6
– Python 3.11 (for document formatting utilities)
– Lucene 9.0.0
A Docker container was used to hold both the IR system and an image of a LaTeX library for
compiling this report.
• Run Procedure: For detailed information about the steps required to run the project please refer
to the readme.md file in the BitBucket repository.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Training</title>
      <p>In this section, we present and analyze the results from various experiments conducted to optimize the
performance of our system. A primary focus of these experiments was to identify the best
configuration of system components by adjusting various hyperparameters. The process of fine-tuning these
parameters, specifically using Optuna, allowed us to explore the vast parameter space and determine
the most efective settings for our retrieval system.</p>
      <p>The system architecture has distinct configurations for both the indexer and the searcher. Both
configurations include placeholders for the hyperparameters, which Optuna dynamically selects to
optimize the performance of the system. The results of these tests are presented in the following
sections.</p>
      <p>For the purposes of training our system we used the CLEF‘s collections snapshots visible in table 1
The results and plots presented in the following sections are summarized in the tables 2 3. For clarity
and conciseness, highly similar configurations have been omitted.</p>
      <sec id="sec-4-1">
        <title>4.1. Indexing</title>
        <p>The performance results of our indexing process, as shown in Table 4, highlight the significant
improvements achieved through multithreading. When using a single-threaded approach, indexing a
12 GB document dataset took approximately 1800 seconds, whereas with multithreading, the same task
was completed in just 176 seconds, demonstrating a substantial reduction in indexing time.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Stemming Strategy</title>
        <p>In our system, we evaluated three stemming strategies tailored for the French language: a conservative
approach through the FrenchMinimalStemFilter, a more balanced method via the FrenchLight
StemFilter, and a more aggressive strategy using the FrenchSnowballStemFilter, which is based
on the Snowball algorithm.</p>
        <p>As shown in Table 2, Optuna consistently selected the FrenchLightStemFilter as the most
efective stemming strategy. It outperformed both the minimal and snowball approaches across the
diferent evaluation metrics, improving retrieval performance without introducing the risks of
underor over-stemming commonly associated with the other two methods.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Stopword Filtering</title>
        <p>Stopword filtering is a preprocessing step in text retrieval systems, aimed at removing high-frequency
words that typically carry low semantic content.</p>
        <p>In our setup, we compared nine diferent stopword lists for French, including several curated sets
from the open-source repository stopwords-iso [19], which ofers community-maintained stopword
lists in multiple formats and dialectal variations.</p>
        <p>Optuna selected the default French stopword list provided by Lucene, accessed via FrenchAnalyzer.
getDefaultStopSet() [20], as the most efective configuration. This built-in list, integrated directly
in the Apache Lucene library, appears to provide a balanced and domain-independent set of common
French stopwords that generalizes well across document collections.</p>
        <p>As shown in Table 3, this default set outperformed all alternative configurations in terms of retrieval
performance, highlighting the reliability of the default Lucene stopword strategy when no
domainspecific list is clearly superior.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. BM25 Similarity Parameter Tuning</title>
        <p>Tuning of Parameter b The parameter b in the BM25 similarity function controls the extent to
which document length is normalized. A value of b close to 0 reduces the influence of document length,
while a value close to 1 fully normalizes scores based on length diferences between documents.</p>
        <p>As shown in Figure 4, Optuna consistently converged toward values of b around 0.85 during the
optimization process. This result aligns well with theoretical expectations in Information Retrieval
literature, where b values between 0.75 and 0.9 are commonly found to perform well across diverse
corpora [21]. In our case, this setting suggests that moderate document length normalization is beneficial
for our collections, allowing longer documents to remain competitive without overwhelming shorter
ones.
Tuning of Parameter k1 The parameter k1 regulates the influence of term frequency in the BM25
scoring function. Higher values increase the contribution of frequently occurring terms in the document,
typically leading to better discrimination between relevant and non-relevant documents.</p>
        <p>Unexpectedly, as illustrated in Figure 5, Optuna identified an optimal value for k1 around 0.95 – lower
than the commonly cited optimal range of 1.2 to 2.0. This result might seem counterintuitive at first.
One possible explanation lies in the nature of our collection and preprocessing pipeline: stemming,
stopword filtering, and n-gram boosting might already amplify term frequency signals or reduce noise,
making a lower k1 suficient or even preferable. Another contributing factor could be the presence of
short, focused documents or queries, where overly aggressive TF scaling leads to overfitting on frequent
terms.</p>
        <sec id="sec-4-4-1">
          <title>Impact of BM25 Parameters on Final Performance Ultimately, the BM25 parameters had a strong</title>
          <p>influence on the final retrieval quality, as BM25 remains the backbone of our ranking system. As shown
in Figure 6, where lighter (yellowish) colors indicate higher performance values, the combination of k1
= 0.9 and b = 0.85 emerged as the most efective configuration. These 3D graphs summarize the
joint efect of both parameters across the collections.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Word N-grams</title>
        <p>Word N-grams are contiguous sequences of  items, typically words, extracted from text, commonly used
in natural language processing tasks to capture local contextual patterns. They ofer a balance between
simplicity and efectiveness, making them valuable for applications such as query expansion,
autocompletion, spell correction, and ranking. In information retrieval, n-grams help identify multi-word
expressions and improve matching between queries and documents.</p>
        <p>Minimum Word N-gram Size The parameter MIN_NGRAM_SIZE controls the shortest sequence of
tokens considered during indexing and query expansion. Lower values (e.g., 1 or 2) focus on very short
phrases, which may increase recall but can introduce noise. As shown in Figure 7, Optuna identified
that a minimum size of 2 yields the best performance. This setting provides a good compromise: it
captures essential short phrases without being overly sensitive to isolated terms or subword tokens.
Maximum N-gram Size The parameter MAX_NGRAM_SIZE sets the upper bound on the length
of word n-grams to be used. Larger values can help detect meaningful multi-word expressions (e.g.,
”freedom of speech” or ”climate change”), but they also increase the number of candidate terms, slowing
down indexing and introducing sparsity. Figure 8 shows the behavior of this parameter across trials:
Optuna found that a value of 3 consistently delivered strong results, capturing useful short phrases
without the overhead of longer, rarer n-grams.</p>
        <sec id="sec-4-5-1">
          <title>Combined Efect on Precision and Recall To assess the combined efect of MIN_NGRAM_SIZE</title>
          <p>and MAX_NGRAM_SIZE, we analyzed the precision-recall behavior of the system under various
configurations. Figure 9 presents the resulting PR curves from TREC evaluation. The optimal configuration,
MIN_NGRAM_SIZE = 2 and MAX_NGRAM_SIZE = 3, emerges clearly as it yields the best balance
between precision and recall.</p>
          <p>N-gram Boost in Query Scoring Beyond indexing, n-grams also influence the scoring phase through
a dedicated boosting mechanism: when an n-gram extracted from the query matches an indexed phrase
in a document, the score of that document is increased proportionally to a weight parameter called
NGRAM_BOOST.</p>
          <p>Optuna identified that a relatively high value, around 2.75, was optimal for this boost factor. This
suggests that multi-word expressions play a significant role in distinguishing relevant documents within
our collections. By giving extra weight to n-gram matches, the system prioritizes documents that
preserve query intent more precisely.</p>
          <p>This outcome also aligns with linguistic intuition: key phrases (e.g., ”data protection act”, ”financial
crisis”) often carry more semantic value than individual terms.</p>
          <p>As shown in Figure 10, Optuna trials converged toward this higher value, reinforcing the importance
of phrase-level matching.</p>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Length Filter</title>
        <p>Length filtering is a fundamental preprocessing step in many Information Retrieval systems. Its goal is
to eliminate tokens that are either too short or too long, which are often semantically uninformative or
problematic for indexing.</p>
        <p>Very short tokens (e.g., one or two characters) often correspond to punctuation, prepositions, articles,
or incomplete terms resulting from tokenization errors. Including them increases index size without
improving retrieval quality and can introduce noise during matching. On the other end of the spectrum,
extremely long tokens are typically rare compound words, malformed terms, or noisy artifacts, especially
in user-generated content, and may negatively afect performance due to low frequency and high
indexing cost [22].</p>
        <p>To address this, our system applies a length filter both during indexing and query processing, removing
tokens outside a predefined length interval, defined by two parameters: LENGTH_MIN_LENGTH and
LENGTH_MAX_LENGTH.</p>
        <p>Minimum Token Length The parameter LENGTH_MIN_LENGTH sets the lower bound on the length
of tokens to be retained. As shown in Figure 11, Optuna trials indicate that setting this value around 2
ofers the best trade-of between eliminating noise and preserving useful query terms. This prevents
overly generic tokens like ”a”, ”on”, or ”at” from afecting the scoring.</p>
        <p>Maximum Token Length The parameter LENGTH_MAX_LENGTH defines the upper bound of
acceptable token length. Words that exceed this length are typically rare or malformed, and indexing them can
result in unnecessary computational overhead. Figure 12 shows how Optuna explored this parameter,
ultimately converging around a value of 17, which efectively removes outliers without discarding
meaningful multi-word terms.</p>
        <p>Joint Impact on Retrieval Performance The combined efect of the minimum and maximum
token length parameters was analyzed using 3D visualizations of retrieval performance. Figure 13
illustrate how diferent combinations influence overall system efectiveness. These plots show that
a configuration around LENGTH_MIN_LENGTH = 2 and LENGTH_MAX_LENGTH = 17 yields the best
balance, filtering out irrelevant or noisy tokens while retaining informative ones.</p>
      </sec>
      <sec id="sec-4-7">
        <title>4.7. Score Threshold</title>
        <p>In information retrieval systems, the SCORE_THRESHOLD parameter defines the minimum score a
document must achieve to be considered relevant and included in the final result set. This threshold
acts as a filter to exclude documents with low relevance scores, thereby improving the precision of the
retrieval system. A threshold set too low may include irrelevant documents, reducing precision, while a
threshold set too high may exclude relevant documents, reducing recall.</p>
        <p>Optimization with Optuna Using Optuna for hyperparameter optimization, we explored various
values for the SCORE_THRESHOLD parameter. The optimization process aimed to find a threshold that
makes the nDCG better.</p>
        <p>As shown in Figure 14, Optuna identified that setting the SCORE_THRESHOLD to approximately 51%
of the highest document score yielded optimal performance. This means that only documents scoring
at least half as high as the top-scoring document are considered relevant [23].</p>
      </sec>
      <sec id="sec-4-8">
        <title>4.8. Spellchecking</title>
        <p>Our spellchecking mechanism works by first searching the token in the inverted index. If the token is
not found, the system attempts to find a similar term within the inverted index and substitutes it in the
query.</p>
        <p>Two main parameters can be configured:
• SPELLCHECKER_NUMBER_OF_SUGGESTIONS → the number of spell correction suggestions to
consider.
• SPELLCHECKER_MIN_TERM_LENGTH → the minimum length of a word to be eligible for
spellchecking.</p>
        <p>Using Optuna for hyperparameter tuning, we observed that the optimal value for SPELLCHECKER
_MIN_TERM_LENGTH is around 4 or 5, while the best value for SPELLCHECKER_NUMBER_OF
_SUGGESTIONS is typically 1 or 2. These trends are clearly visible in the figure 15:</p>
      </sec>
      <sec id="sec-4-9">
        <title>4.9. Proximity Reranking</title>
        <p>The reranking mechanism works by taking the documents retrieved by the initial search, analyzing the
query, and extracting all words pairs from the query. These pairs are then searched within the retrieved
documents, considering only occurrences where the two words appear within a configurable maximum
distance.</p>
        <p>The final document score is computed by combining the BM25 score with the proximity score using
configurable weights. As seen in the previous section, the following parameters can be tuned:
• PROXIMITY_RERANKER_MAX_SLOP → the maximum allowed distance between two words for
proximity reranking.
• BM25_WEIGHT → the weight assigned to the BM25 score in the final ranking.</p>
        <p>• PROXIMITY_WEIGHT → the weight assigned to the proximity score in the final ranking.</p>
        <p>In figure 16 we can see that Optuna identified the best value for PROXIMITY_RERANKER_MAX_SLOP
as 4, which represents a good trade-of between flexibility and relevance in matching word pairs.
Additionally, the optimal weights for combining the scores were approximately 0.6 for BM25 and 0.4
for proximity reranking.</p>
      </sec>
      <sec id="sec-4-10">
        <title>4.10. More infos</title>
        <p>To view the details of our TREC Eval runs, you can visit the following link:
https://bitbucket.org/upddei-stud-prj/seupd2425-basette/src/master/eval/. We have uploaded all of our TREC Eval runs across
various parameters, providing a comprehensive overview of the results.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In the first part of this section, we provide an overview of the retrieval systems we developed, along
with their performance on the test set. This allows us to assess and compare their efectiveness under
standardized conditions. The evaluation is aimed at understanding how well these systems generalize
to unseen data and so how good they are in real-world scenarios.</p>
      <p>The assessment was conducted on the snapshots in table 5, which are part of the test set provided by
the Longeval benchmark.
Below is a brief description of the four retrieval systems we analyzed, along with their performance
based on the TREC evaluation metrics. Each system represents a diferent stage of development or
enhancement, allowing us to assess the impact of various techniques on retrieval quality.
Default This system uses Lucene’s default configuration, without any fine-tuning. It serves as our
baseline, helping us understand how much improvement has been achieved through our customizations.
The system relies on the standard FrenchAnalyzer, includes basic stopword removal, and uses Lucene’s
default BM25 parameters for the search component. As shown in Table 6, the NDCG scores for this
system are consistently lower than the others. This is expected, given that no optimization has been
applied, making it a useful reference point for evaluating the efectiveness of our modifications.
Fine-Tuned This version incorporates the parameters obtained through our tuning process using
Optuna, as detailed in the training section 4. Unlike the default system, it benefits from targeted
adjustments to BM25 and other indexing/search parameters. The results show a clear improvement in
NDCG compared to the baseline, confirming that our optimization process has had a positive impact on
retrieval performance.</p>
      <p>Spell Checking In this system, we added a spell-checking component on top of the fine-tuned
configuration. The idea was to test whether correcting potential typos in user queries could lead to
better results. Although we were initially unsure of the efect this would have, the system ended up
performing slightly better than the fine-tuned version on average across the snapshots. This suggests
that spell checking contributes positively in this context, at least with our chosen setup and data.
Re-Ranking The final system extends the spell-checking setup by adding a positional re-ranking
step. After retrieving the top-ranked documents, this additional phase re-orders them based on
positional scoring. Among all the systems, this one achieved the best overall performance. However, this
improvement comes with a cost: query execution time is roughly twice as long compared to the other
systems. Despite the slower response, the gain in relevance indicates that re-ranking is a worthwhile
step.</p>
      <sec id="sec-5-1">
        <title>5.2. Statistical Analysis</title>
        <p>To understand whether the diferences in NDCG scores across the various retrieval systems are
statistically meaningful, we carried out a set of analyses of variance (ANOVA).</p>
        <sec id="sec-5-1-1">
          <title>5.2.1. One-way ANOVA: System Efect</title>
          <p>We began with a one-way ANOVA, a statistical test used to determine whether there are significant
diferences between the means of three or more independent groups, in our case, the four retrieval
systems. This type of analysis helps assess whether the choice of system has a real impact on the
retrieval performance (measured by NDCG), or whether the observed diferences could be due to random
chance.</p>
          <p>Table 7 reports the ANOVA results. The p-value associated with the system factor is extremely
small ( &lt; 0.001), indicating that the type of system used has a statistically significant efect on the
NDCG scores. In other words, not all systems perform the same, and the diferences we observe are
unlikely to be due to noise.</p>
          <p>Figure 17 shows the distribution of NDCG scores for each system. It’s immediately apparent that
the three systems we developed, fine_tuned, spellchecker, and reranking, achieve higher and
more consistent scores compared to the default system. Among our methods, the reranking system
performs the best on average, though all three are relatively close to one another.</p>
          <p>To explore these diferences in more detail, we conducted a Tukey’s HSD (Honestly Significant
Diference) test. This post-hoc analysis is used after ANOVA to perform pairwise comparisons between
all groups and determine exactly which ones difer from each other significantly.</p>
          <p>The result is shown in Figure 18, which presents the confidence intervals for all system comparisons.
This test confirms what the boxplot already suggests: the default system is statistically distinct from
the other three, which form a relatively homogeneous group. In fact, Tukey’s test efectively identifies
two statistically significant "clusters": one composed of the default system, and another grouping
together fine_tuned, spellchecker, and reranking. This separation reinforces the idea that the
optimizations we introduced led to real, consistent improvements in retrieval performance compared to
the baseline.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.2.2. Two-way ANOVA: System and Snapshot Interaction</title>
          <p>To better understand how both the retrieval system and the specific temporal snapshot influence
performance, and whether there is any interaction between the two, we performed a two-way ANOVA.
This type of analysis allows us to assess the independent efects of each factor (in this case, system and
snapshot), as well as how their combination might afect the outcome (NDCG scores).</p>
          <p>The summary of the analysis is shown in the table below:</p>
          <p>The results confirm that both factors, system and snapshot, have a significant efect on NDCG
scores ( &lt; 0.001 in both cases). This means not only does the choice of retrieval system influence the
performance, but so does the specific snapshot of data used for evaluation.</p>
          <p>The boxplot in Figure 19 provides a visual summary of these efects. Similar to the one-way case, we
observe that the three optimized systems (fine_tuned, spellchecker, and reranking) perform
noticeably better than the default system. However, a key detail emerges here: performance on the
2023_08 snapshot is consistently lower across all systems. The other snapshots show relatively stable
and balanced results, but 2023_08 appears to be an outlier in terms of performance drop.</p>
          <p>To dive deeper into these interactions, we ran a post-hoc Tukey HSD test, which identifies specific
pairs of group combinations (system × snapshot) that difer significantly from one another. The results,
shown in Figure 20, reveal several statistically distinct groupings.</p>
          <p>Notably, the Tukey test separates the combinations into four broad groups:
• The worst-performing group: default on 2023_08.
• A second lower tier: the other systems also on 2023_08.</p>
          <p>• A third group: default across all other snapshots.</p>
          <p>• The top group: all other systems (fine_tuned, spellchecker, reranking) across snapshots
2023_03 to 2023_07.</p>
          <p>This suggests that 2023_08 has a negative efect on system performance in general, regardless of
which system is used. It is likely that this drop is not entirely due to the systems themselves, but rather
to some property of the 2023_08 snapshot, possibly related to data quality, domain shift, or changes in
content distribution during that period.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>In this project, we developed a configurable and multithreaded Information Retrieval system tailored
for eficient operation on basic and commonly available hardware, in line with the CLEF LongEval
lab’s goals. Our guiding principle was to prioritize performance and adaptability without relying on
specialized computational resources such as GPUs. To this end, we built a Java-based architecture
powered by Lucene, with extensive support for configuration-driven customization of preprocessing,
indexing, and retrieval strategies.</p>
      <p>We conducted a systematic exploration of classical IR techniques, including text normalization, token
ifltering, stemming, and stopword removal, as well as the use of n-gram matching and proximity-based
reranking. Our system achieved significant performance gains through multithreaded indexing and
querying, with indexing time reduced by nearly an order of magnitude. Additionally, we employed
Optuna for automatic hyperparameter optimization, allowing us to fine-tune our system for both
retrieval efectiveness and stability over time.</p>
      <p>Several advanced techniques, such as semantic expansion via WordNet, temporal alignment using
Duckling, and neural reranking with CamemBERT, were considered but ultimately discarded due to
either limited performance improvements or prohibitive computational costs. These decisions were
consistent with our core objective: building a fast, tunable IR system that works reliably on modest
hardware.</p>
      <p>Looking forward, we plan to explore lightweight semantic models, expand our support for multilingual
corpora, and revisit some of the discarded approaches under improved hardware conditions. Our
longterm vision remains consistent with the title of this project: designing an eficient and adaptable IR
system for all hardware profiles, capable of balancing retrieval efectiveness with practical performance
constraints.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>
        The authors thank the organisers of CLEF 2025 LongEval for providing the data and evaluation
infrastructure [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors employed ChatGPT (OpenAI) exclusively to (1) refine English grammar and style and
(2) rephrase or clarify selected sentences for improved conceptual clarity. All AI-suggested text was
thoroughly reviewed, edited, or discarded at the authors’ discretion, and no AI tool was used to generate
original scientific content, devise research ideas, or draw conclusions. The authors take full responsibility
for the accuracy and integrity of the manuscript. This usage is fully compliant with the CEUR-WS
Policy on AI-Assisting Tools.
[9] M. Didion, extjwnl: Extended java wordnet library, https://github.com/extjwnl/extjwnl, 2014.</p>
      <p>Accessed: 2025-05-22.
[10] B. Sagot, The Leff, a freely available and large-coverage morphological and syntactic lexicon
for French, in: Proceedings of the Seventh International Conference on Language Resources and
Evaluation (LREC’10), European Language Resources Association (ELRA), Valletta, Malta, 2010.</p>
      <p>URL: http://alpage.inria.fr/~sagot/wolf-en.html, also includes WOLF: WordNet Libre du Français.
[11] O. Lutz, Synonymes - github repository, https://github.com/olup/synonymes, 2021. Accessed:
2025-05-22.
[12] Facebook AI Research, Duckling: A haskell library for parsing temporal expressions, https://github.</p>
      <p>com/facebook/duckling, 2016.
[13] Facebook, Duckling docker image, https://hub.docker.com/r/facebook/duckling, 2025.
[14] L. Martin, B. Muller, P. J. O. Suárez, Y. Dupont, L. Romary, É. V. de la Clergerie, D. Seddah, B. Sagot,
Camembert: a tasty french language model, in: Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, Association for Computational Linguistics, 2020, pp.
7203–7219. URL: https://aclanthology.org/2020.acl-main.645.
[15] A. AI, Deep java library (djl), https://djl.ai, 2025. Accessed: 2025-05-21.
[16] H. Face, camembert-base - hugging face, https://huggingface.co/camembert-base, 2020. Accessed:
2025-05-21.
[17] FasterXML, 2025, Jackson (GitHub Repository), URL: https://github.com/FasterXML/jackson.
[18] C. L. Organizers, 2025, CLEF LongEval Data, URL: https://clef-longeval.github.io/.
[19] S. I. Project, French stopword lists, 2025. URL: https://github.com/stopwords-iso/stopwords-fr/
tree/master/raw, accessed: 2025-05-25.
[20] A. Lucene, Frenchanalyzer (lucene 5.0.0 api), 2015. URL: https://lucene.apache.org/core/5_0_0/
analyzers-common/org/apache/lucene/analysis/fr/FrenchAnalyzer.html, accessed: 2025-05-25.
[21] D. Turnbull, Practical bm25 - part 3: Considerations for
picking b and k1 in elasticsearch, 2019. URL: https://www.elastic.co/blog/
practical-bm25-part-3-considerations-for-picking-b-and-k1-in-elasticsearch, accessed:
2025-0525.
[22] Elastic, Length token filter, 2025. URL: https://www.elastic.co/docs/reference/text-analysis/
analysis-length-tokenfilter, accessed: 2025-05-25.
[23] M. Lee, Better rag retrieval — similarity with threshold, 2023. URL: https://meisinlee.medium.com/
better-rag-retrieval-similarity-with-threshold-a6dbb535ef9e, accessed: 2025-05-25.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cancellieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Ebshihy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez-Saez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iommi</surname>
          </string-name>
          , J. Keller, P. Knoth,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pride</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schaer</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2025 LongEval Lab on Longitudinal Evaluation of Model Performance</article-title>
          , in: J.
          <string-name>
            <surname>Carrillo-de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Apache</surname>
          </string-name>
          ,
          <year>2023</year>
          ,
          <source>Lucene v9.5</source>
          .0, URL: https://lucene.apache.org/core/9_5_0/index.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <surname>N-</surname>
          </string-name>
          <article-title>gram language models</article-title>
          , 3rd ed., draft ed., Stanford University,
          <year>2025</year>
          . URL: https://web.stanford.edu/~jurafsky/slp3/3.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hariri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kurgan</surname>
          </string-name>
          ,
          <source>Parallel Computing and Information Retrieval</source>
          , Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          ,
          <article-title>The probabilistic relevance framework: Bm25 and beyond</article-title>
          ,
          <source>Foundations and Trends in Information Retrieval</source>
          <volume>3</volume>
          (
          <year>2009</year>
          )
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          . doi:
          <volume>10</volume>
          .1561/1500000019.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Software</surname>
          </string-name>
          <string-name>
            <surname>Foundation</surname>
          </string-name>
          , Apache lucene - similarity models,
          <year>2025</year>
          . URL: https://lucene.apache. org/core/9_9_2/core/org/apache/lucene/search/similarities/BM25Similarity.html.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Akiba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yanase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ohta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koyama</surname>
          </string-name>
          ,
          <article-title>Optuna: A next-generation hyperparameter optimization framework</article-title>
          ,
          <source>in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining, ACM</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2623</fpage>
          -
          <lpage>2631</lpage>
          . URL: https://optuna.org/. doi:
          <volume>10</volume>
          .1145/3292500.3330701.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Wordnet: A lexical database for english</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>38</volume>
          (
          <year>1995</year>
          )
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>