<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Figure</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>DS@GT at LongEval: Evaluating Temporal Performance in Web Search Systems and Topics with Two-Stage Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anthony Miyaguchi</string-name>
          <email>acmiyaguchi@gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Imran Afrulbasha</string-name>
          <email>iafrulbasha3@gatech.edu</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aleksandar Pramov</string-name>
          <email>apramov3@gatech.edu</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>North Ave NW, Atlanta, GA 30332</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <volume>1</volume>
      <issue>20</issue>
      <abstract>
        <p>Information Retrieval (IR) models are often trained on static datasets, making them vulnerable to performance degradation as web content evolves. The DS@GT competition team participated in the Longitudinal Evaluation of Model Performance (LongEval) lab at CLEF 2025, which evaluates IR systems across temporally distributed web snapshots. Our analysis of the Qwant web dataset includes exploratory data analysis with topic modeling over time. The two-phase retrieval system employs sparse keyword searches, utilizing query expansion and document reranking. Our best system achieves an average NDCG@10 of 0.296 across the entire training and test dataset, with an overall best score of 0.395 on 2023-05. The accompanying source code for this paper is at https://github.com/dsgt-arc/longeval-2025.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Cranfield Paradigm</kwd>
        <kwd>Text Mining</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Query Expansion</kwd>
        <kwd>Temporal Drift</kwd>
        <kwd>Re-ranking</kwd>
        <kwd>Qwant</kwd>
        <kwd>Topic Modeling</kwd>
        <kwd>Normalized Discounted Cumulative Gain (nDCG)</kwd>
        <kwd>Latent Dirichlet allocation (LDA)</kwd>
        <kwd>NonNegative Matrix Factorization (NMF)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>1. BM25 Baseline: standard BM25 retrieval without any query expansion or reranking.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        LongEval’s motivation is rooted in the observed decline in IR-model performance over time [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. An
extensive survey documented this efect and concluded that increasing temporal distance degrades
relevance, calling for retrieval models that incorporate temporal features [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Previous work connected
retrieval accuracy to calendar cycles, showing that systems ignoring weekly and yearly periodicities
systematically under-rank timely documents—such as sports fixtures or fiscal reports [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Previous LongEval research quantified the degradation and proposed mitigations, including frequent
model updates and query-time reranking [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The present work extends that line by first characterizing
the upstream data through topic modeling—a method explored in web-search retrieval [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]—and then
presenting a more resilient architecture for large-scale experimentation.
      </p>
      <p>
        The approach adopted in our work follows the standard two-stage pipeline in which a sparse BM25
retriever is followed by a neural re-ranker [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Because automatic query expansion has repeatedly
improved retrieval efectiveness [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the system queries Google’s Gemini LLM [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to generate expansions
before retrieval.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Qwant Search Engine Dataset</title>
      <sec id="sec-3-1">
        <title>3.1. Shared Task Data Collection</title>
        <p>
          We briefly touch upon the acquisition of pages and queries from the commercial search engine Qwant
as part of the shared task setup. The approach largely follows the Cranfield Paradigm, where data
acquisition is conducted periodically, forming a sequence of sub-collections [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The data collection
process involves constructing topics, queries, relevance estimates, and documents.
        </p>
        <p>
          Topics are selected once and shared across all subsequences guided by several criteria [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. First,
topics needed to be popular enough to generate a substantial number of relevant queries. Second, they
had to be stable over time to support longitudinal performance evaluation of the information retrieval
(IR) system. The persistence of these topics across diferent periods is assessed with tools like Google
Trends [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Lastly, the topics were required to be general enough to encompass a wide variety of
queries. The final set of topics chosen by the dataset organizers is shown in Table 1.
        </p>
        <p>
          The query selection process begins with extracting user topics and mapping them to real queries
answered by Qwant’s search engine, ensuring all displayed results are indexed. Queries are matched
to topics using substring filtering, forming sets of relevant queries per topic. Since this process can
generate tens of thousands of queries, a top-k selection retains only the most frequently asked queries
per topic, reducing redundancy [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>= ⋃︁  such that  = { |  ∈ ,  ⊆ str } (1)</p>
        <p>∈</p>
        <p>
          If Q is the set of all Qwant queries and T is the set of topics defined in Table 1, for each topic  ∈ 
the lab organizers select all the queries  from Q that contain t as a sub-string [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          Next, queries undergo automatic filtering, selecting only those with at least 10 relevance assessments,
followed by a manual review to merge similar queries and remove adult content [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. This structured
ifltering refines the dataset, ensuring more reliable query distributions.
        </p>
        <p>
          The relevance estimates for LongEval-Retrieval rely on implicit feedback from user clicks, as Qwant
preserves privacy by not tracking multiple clicks, dwell times, or query reformulations [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Since
raw click data is noisy and biased toward top-ranked results, Click Models are used to infer document
relevance while minimizing bias. Given Qwant’s privacy constraints, a Cascade Model—a simplified
version of Dynamic Bayesian Networks (DBN)—is employed, where users scan results from top to
bottom and click only on attractive documents [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The attractiveness parameter ( ) is estimated
through Maximum Likelihood Estimation (MLE), providing a probabilistic relevance measure for
querydocument pairs. To make these relevance estimates compatible with traditional IR metrics,  is mapped
to discrete relevance values (0 = not relevant, 1 = relevant, 2 = highly relevant). The MLE is given by
the following form for a query q and document d where , denotes the set of all instances in Qwant’s
query log in which document  was displayed in a search engine results page (SERP) at or above the
rank of a clicked document. The binary variable () indicates whether document  was clicked within
a given entry .
        </p>
        <p>^, =
1</p>
        <p>∑︁ ()
|,| ∈,</p>
        <p>The document corpus for LongEval-Retrieval is extracted from Qwant’s search index, including both
the documents displayed in search results and a random sample of non-relevant documents to minimize
bias. To avoid skewing the corpus toward Qwant’s ranking function, up to 100,000 documents per topic
are randomly selected based on matching word tokens. Before inclusion, documents undergo a cleaning
process, in which their text is extracted using Qwant’s internal tools and filtered for adult and spam
content.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Train Dataset Statistics</title>
        <p>We provide statistics over the training dataset to aid decisions related to cost-efectiveness and evaluation
time. We use the tiktoken library to tokenize documents alongside a naive whitespace tokenizer. In
table 2, we see that there about two million documents per time-step, with an average of 850 words
or 1,300 tokens per document. We provide per-million (PM) count statistics which can be used in
throughput to cost or time calculations in table 3. and table 4.</p>
        <p>This data is useful to extrapolating dense-retrieval workflows. Full-scale embedding using
selfhosted encoder-only sentence transformers require almost an entire day of GPU hours. With the
nomic-ai/nomic-embed-text-v2-moe encoder on the train dataset for 2022-08, we run 100%
utilization against a v100 GPU in 50 parts at 25 minutes a part for a total of 21 hours of GPU time.</p>
        <p>We can use these statistics to estimate the cost of what it might take to process all the data with
a large language model (LLM). For example, as of May 2025, the Google Gemini 2.0 model costs 10
cents per million tokens of input and 40 cents per million tokens of output. The projected cost of
(2)
reading the entire training dataset as input context would be $2,528 USD, which is cost-prohibitive for
experimentation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Topic Modeling for Exploratory Data Analysis</title>
      <p>Topic modeling is a machine learning technique used to uncover latent themes or patterns from
large datasets, providing insight into the evolving structure of textual information. In the context of
LongEval, we leveraged topic modeling to analyze how thematic distributions shift over time, capturing
longitudinal trends in digital content. Recognizing non-trivial topic drifts could provide some insight
into any search engine performance fluctuations.</p>
      <sec id="sec-4-1">
        <title>4.1. Topic Models</title>
        <sec id="sec-4-1-1">
          <title>4.1.1. Non-negative Matrix Factorization</title>
          <p>
            Non-Negative Matrix Factorization (NMF) factorizes a given non-negative matrix X into two
lowerdimensional non-negative matrices, W and H [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. This factorization is particularly useful for topic
modeling, as it provides an interpretable structure where documents are mixtures of topics characterized
by distributions over words. By factorizing the document-term matrix, we can capture the underlying
themes present in long-term information retrieval datasets, helping to evaluate how topic distributions
shift over time.
          </p>
          <p>
            The optimization objective for NMF seeks to minimize the reconstruction error between X and
W × H [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. X ∈ R×  represents the document-term matrix, with  denoting the total number of
documents in the corpus and  representing the number of unique terms in the vocabulary. The matrix
W ∈ R×  captures the document-topic distribution, where each row indicates the strength of topic
presence in a document. Meanwhile, H ∈ R×  defines the topic-word associations, with each row
highlighting the key terms that define a given topic.
          </p>
          <p>The NMF objective is as follows:</p>
          <p>min
W≥ 0,H≥ 0
 (W, H) = ‖X −</p>
          <p>
            WH‖2
where ‖ · ‖ 2 denotes the Frobenius norm. The constraints W ≥ 0 and H ≥ 0 enforce non-negativity,
ensuring that topics and their respective word distributions remain interpretable [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. The Frobenius
norm ensures smooth optimization, enabling faster convergence with gradient-based methods. Unlike
the generalized Kullback-Leibler (KL) divergence, it does not require a probabilistic interpretation of
the input matrix, making it more suitable for general-purpose topic modeling.
          </p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Latent Dirichlet Allocation</title>
          <p>
            LDA is a probabilistic framework for unsupervised topic discovery [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. It assumes that each document
is composed of multiple latent topics, with each topic represented by a distribution over words. The
following joint probability distribution governs the complete generative process in LDA.
          </p>
          <p>( 1: ,  1:, 1:, 1:) = ∏︁ ( ) ∏︁ ( )
=1 =1
︃( 
∏︁ (,| )(,| 1: , ,)
=1
)︃
where:
•  1: are the topic distributions over the vocabulary.
•  1: are the document-specific topic proportions.
• , is the topic assignment for the the word in document , drawn from Multinomial( ).
• , is the observed word, drawn from the topic ,’s word distribution  , .</p>
          <p>This joint distribution reflects the conditional dependencies among the variables:
• The topic assignment , depends on the document-specific topic proportions  .
• The observed word , depends on the topic assignment , and all the topics  1: .</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Exploratory Data Analysis</title>
        <sec id="sec-4-2-1">
          <title>4.2.1. Methodology</title>
          <p>
            Our modeling pipelines for NMF and LDA both leverage Luigi[
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] for orchestration, PySpark
(SparkSQL/MLlib) for preprocessing, and scikit-learn for end-to-end model execution [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. We randomly
sample documents from the entire Qwant collection to ensure a diverse subset for topic modeling
while reducing computational load. The text preprocessing stage involves tokenization followed by
conversion into a document-term matrix using a term frequency representation. Next, both NMF and
LDA topic models are trained with 20 topics to extract topic distributions across documents. An LLM
summarizes the 100 top words for each topic, specifically Grok 3, a 2.7 trillion parameters LLM.
          </p>
          <p>To visualize NMF results, the topic distributions of a random subset of documents within an arbitrary
month were inferred using a pre-trained NMF model, yielding document-topic association scores. For
both NMF and LDA, we projected the high-dimensional topic embeddings into two dimensions to
facilitate clustering analysis using principal component analysis (PCA) and Gaussian random projections
(GRP). We create a scatter plot of the projections, with colors indicating the dominant topic assignments
per document. These visualizations ofer insight into the topic structure uncovered by both modeling
methods while preserving essential document relationships in a reduced space.</p>
          <p>Book Resources
Sustainable development
Travel Accommodations
Auctions
French Local Governance
Social Impact
2022_06
French General History
English News and Politics
French Media production
Online Forums
Online Pharmacies
French E-commerce
Blog Archives
French politics
French Public Administration
English Online Engagement
Global Development
French Elections
Website Analytics
French News
2022_12
French Public Services and Environment
English News and History
Online Pharmacies
French E-commerce
Countries and Geopolitical Regions
French Personal Communication
Hotel Bookings and Pricing
English Social Media
Events and Eurovision content
International Organizations, and Finance
Film Awards and Festivals
French Jobs and Recruitment
Data privacy and Account Management
French Politics and Global Afairs
Vehicle Sales and Real Estate
French Fuel Prices and Regional Departments
French Administrative Processes and Health
Website cookies and User Analytics
Product pricing and Shipping
Book Pricing and Newsletter Subscriptions
(a) NMF PCA topic distributions.
(b) NMF GRP topic distributions.
(c) LDA PCA topic distributions.</p>
          <p>(d) LDA GRP topic distributions.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Discussion</title>
          <p>Figures 1a and 1b show topic groupings for equally spread out subsequences in the overall collection
spanning June 2022 to February 2023. These topics are coherent and cover subjects one might expect to
search for, including public services, e-commerce, social media, jobs, real estate, and travel. The topics
also capture temporal saliency through current events, such as fluctuations in fuel prices and upcoming
elections.</p>
          <p>In terms of French text preprocessing, our LDA topic summary documents display a substantial
amount of filler text, as shown in Table 5. A subsequent attempt at document cleanup might involve using
NLTK’s WordNetLemmatizer, spaCy, Gensim, or similar libraries. In general, we found that, qualitatively,
LDA produced less meaningful results than NMF, as our topic description file primarily contained filler
words, stopwords, and numbers, making it challenging to summarize coherent topics. Attempts to
refine LDA preprocessing through stopword removal and lemmatization did not suficiently filter out
non-informative words, resulting in reduced topic interpretability compared to the more structured
outputs from NMF; with these changes, our visual clustering results did not change significantly.</p>
          <p>We show our final visualization of temporal topic groupings in Figure 2. NMF tends to produce
sparse topic representations, meaning that each topic often consists of a smaller set of highly weighted
words. LDA’s probabilistic nature can sometimes result in topics where many words have a non-zero
probability, potentially making them less distinct. There are some interesting patterns in these visual
representations, such as the upper orange topic group in our NMF PCA graphs, which appears around
October 2022 in Figure 2 and aligns temporally with the sudden NDCG spike measured by our evaluation
system.</p>
          <p>We also attempted to plot the data using neighborhood manifold techniques, such as t-SNE and
UMAP. However, both resulted in out-of-memory issues despite multiple runs to tune n_components
and n_neighbors, likely due to a combination of sample size and probability vectors being too large
to fit into memory for the optimization routines.</p>
          <p>As a note for future reruns of this experiment, quantitative coherence metrics (, , )
provide objective, standardized measures of topic quality that eliminate the subjectivity inherent in
visual clustering assessments, enabling reproducible model comparison and systematic hyperparameter
optimization. We mention this because temporal grouping similarities are visually less dynamic in
LDA graphs compared to NMF graphs. This separation diference is likely due to LDA’s inability to
thoroughly clean stop words and special characters, resulting in more similar 100-word summaries for
each topic. Interestingly, despite the lack of token preprocessing in NMF, the resulting plots demonstrate
more dynamic shifts in grouping over time. NMF may be capturing the underlying structure of our
dataset more efectively than LDA. Additionally, having more time to experiment with larger samples
of the data and varying the number of topics could lead to more distinct grouping outcomes and topic
summaries.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Methodology</title>
      <sec id="sec-5-1">
        <title>5.1. Evaluation System</title>
        <p>Our information retrieval engine for document ranking is a preliminary attempt to handle large-scale
data processing within high-performance computing environments. We use a standard two-stage
retrieval pipeline to find relevant documents for each query. Queries are expanded to include a larger
number of keywords for a BM25 keyword search to get the 100 most relevant documents. These
documents are then reranked using a cross-encoder before being sent of for evaluation.
1 { { q u e r y _ t e x t } }
2
3 F o r e a c h q u e r y above , g e n e r a t e a q u e r y e x p a n s i o n i n F r e n c h t h a t i n c l u d e s
a d d i t i o n a l r e l e v a n t t e r m s o r p h r a s e s .
4 The q u e r y e x p a n s i o n s h o u l d be no l o n g e r t h a n 1 0 0 words .
5 The q u e r y e n g i n e r e l i e s on BM25 and v e c t o r s e a r c h t e c h n i q u e s i n F r e n c h .
6 The o u t p u t s h o u l d be a JSON a r r a y o f o b j e c t s , e a c h c o n t a i n i n g t h e o r i g i n a l
’ q i d ’ and t h e expanded ’ query ’ .</p>
        <p>(a) Prompt for query expansion.
1 {
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 }
" t y p e " : " a r r a y " ,
" i t e m s " : {
" t y p e " : " o b j e c t " ,
" p r o p e r t i e s " : {
" q i d " : {
" t y p e " : " s t r i n g " ,
" d e s c r i p t i o n " : " The q u e r y i d e n t i f i e r . " ,
} ,
" q u e r y " : {
" t y p e " : " s t r i n g " ,
" d e s c r i p t i o n " : " The t e x t o f t h e q u e r y . " ,
} ,
} ,
} ,
" r e q u i r e d " : [ " q i d " , " q u e r y " ] ,
" a d d i t i o n a l P r o p e r t i e s " : F a l s e ,</p>
        <p>(b) JSON schema for the structured output.</p>
        <p>
          As illustrated in Figure 3, the system begins by ingesting Qwant documents, transforming them
into Parquet format, and applying optional sentence-transformer embeddings. These preprocessing
steps are executed on Georgia Tech’s PACE supercomputing cluster, orchestrated via SLURM workload
managers [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], with the processed data stored in a shared directory for downstream retrieval. For
the retrieval workflow, we employ Pyserini [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] for keyword searches, with supporting batch
processing pipelines implemented using PySpark and Luigi [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. At query time, the system references a
precomputed mapping index of Qwant queries to their Gemini-expanded variants. We use the prompt
in Figure 4 and ensure that the output adheres to a concrete schema. All hyperparameters such as
temperature are set to the default values as per the Gemini API SDK.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Evaluation Metrics</title>
        <p>We evaluate the system against the relevant information retrieval measures for the task. The normalized
Discounted Cumulative Gain (nDCG) metric is defined as follows:
 = 
 = ∑︀ 
=1 log2(+1)
 = ∑︀ 
=1 log2(+1)
(3)</p>
        <p>IDCG represents the maximum achievable DCG with the same set of relevance scores but in the
perfect ranking order. This equation rewards relevant documents appearing early in the ranked list and
is especially important in web search contexts. We also measure the relative nDCG drop, the formula
for which is as follows under the context of lag data:
  =
lag6 − lag8
lag6
(4)</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <sec id="sec-6-1">
        <title>6.1. Retrieval System</title>
        <p>We report the training and test NDCG@10 for our four experiments. The BM25 experiment
yields the top 100 results per query. Query expansion replaces the original query with terms
generated from an LLM. Reranking utilizes a French-specific reranking sentence transformer
(antoinelouis/crossencoder-camembert-base-mmarcoFR) to reweight result sets. We do not
perform fine-tuning on the reranking model.</p>
        <p>In the average over NDCG@10 in Table 6, we find that reranking provides the most gain in
performance in our pipeline. When considering ablation, removing the reranking stage reduces performance
by 0.11, whereas removing query expanded results increases the score by 0.01. The performance pattern
holds generally over time. The reranked results score higher than non-reranked results, while the
original queries perform better than expanded queries.</p>
        <p>In the scores over time in Table 7, scores correlate well during the roughly one-year period preceding
the last date in the test set. Before this time, the performance across all models decreases significantly.
We observe that the relative ranking between models follows the average score.</p>
        <p>Experiment
bm25-reranked
bm25-expanded-reranked
bm25
bm25-expanded
0.296
0.295
0.242
0.194
0.371
0.375
0.337
0.314</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <p>The three main conclusions we can draw from the results of the retrieval system are that rerankers
provide proper relevancy signals over keyword result sets, that our particular query expansion method
reduces the performance of searches in this dataset, and that anomalous behavior exists in the system’s
performance prior to October 2022.</p>
      <p>Rerankers have proven efective in practice for IR systems, so it is reassuring to see that the raw
BM25 result sets exhibit a performance boost. Although we did not perform any further tuning on our
dataset, we observed a consistent performance gap between systems that remains constant over time.</p>
      <p>Despite query expansion underperforming relative to the original queries, the reranking mechanism
can account for those diferences. What this implies, then, is that, on average, the query results in a
similar number of relevant documents in the top 100. However, in the expanded queries, important
frequency-adjusted keywords are lost due to factors such as repetition. The reranking semantic space
can overcome the limitations of keyword-based retrieval, thus bringing more relevant documents to the
top. We need to recompute the MAP scores to strengthen this conjecture. However, this would help
explain the performance gap we are seeing with the query expanded results.</p>
      <p>Date</p>
      <p>We also note that during one of the batches of query expansion prompting, we encountered
contentbased restrictions due to the appearance of explicit content search terms. These rows appear in the
last 1/100th of queries sorted by their query IDs in numerical order. We use DeepSeek R1 to generate
the last batch of query expansions. We found that OpenAI models result in a similar set of problems
around content filtering, making it challenging to perform query expansion when using an external
API on a model that is not self-hosted. In the future, we might instead rely on self-hosted model like
Llama or Gemma for query expansion.</p>
      <p>Lastly, we make note of the regime change one year prior to the last element in the test set. While
there are minor fluctuations that warrant further study regarding the performance gap between models,
this shift in performance is challenging to explain. One possibility is that the distribution of tokens
changes significantly in the first few months of the dataset — for example if the document set had a
larger distribution of non-French documents, which could cause issues with the French analyzers used
by Lucene. Another possibility is that many of the queries have temporal saliency to a particular event
in time, i.e., a non-trivial number of queries that reference an event that occurs in September or October
2022.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Future Work</title>
      <p>Future work would involve incorporating dense retrieval into the pipeline. Embedding the entire set of
documents is a computationally expensive endeavor, but it would likely significantly improve retrieval
results. Models like Nomic Embed v2 demonstrate state-of-the-art retrieval performance across multiple
languages and domains and would likely prove viable as a lens into the challenges posed by LongEval.
We provide some analysis into the running time for the entire dataset in Section 3.2, and reiterate that
each date split takes about a day of Nvidia v100 GPU-time.</p>
      <p>We would also like to dive deeper into the topic modeling exploratory analysis. The generative
keyword-based approach, summarized by a stronger LLM, provides a cost-efective way to organize
documents into non-overlapping clusters and ofers deeper insights than geometric methods like
Kmeans. While we have examined changes in distribution over time, we would like to see how retrieval
scores change over time for popular topics identified at particular snapshots in time.</p>
      <p>Future direction might also include retrieval methods that diverge from the typical solutions that
involve keyword and dense-embedding searches. Given the network structure of the web, it seems
natural to perform retrieval based on implicit link structures between pages. A two-stage retrieval
pipeline like ours can be augmented with node centrality measures, such as PageRank, to help reweight
the relevancy of important documents based on the link structure. Methods like K-core decomposition
can also help prune documents that are likely to be high-frequency noise in the graph. Graph theoretic
analysis techniques also apply to the semantic k-NN graph of a dense embedding model. Although
the link structure is diferent, the mechanics of the algorithms on the pipeline would be interesting to
explore in the context of the temporal evolution of systems.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusions</title>
      <p>Our experiments reveal the efectiveness of pre-trained reranking methods in enhancing retrieval
performance within keyword-based search systems. We uncover a temporal anomaly in the search
system, where query performance degrades in partitions older than a year. LDA and NMF both ofer
insight into data evolution, with NMF yielding more apparent cluster separation for our
temporalaware IR system. Existing techniques for retrieval and analysis still present many opportunities for
refinement in future LongEval labs, which would be a critical medium for advancing resilient, time-aware
information retrieval.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgements</title>
      <p>Thank you to the DS@GT CLEF team for their support. This research was also supported in part through
cyberinfrastructure resources and services provided by the Partnership for an Advanced Computing
Environment (PACE) at the Georgia Institute of Technology, Atlanta, Georgia, USA.</p>
    </sec>
    <sec id="sec-11">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Gemini Pro and Grammarly in order to: Abstract
drafting, formatting assistance, grammar and spelling check. After using these tool(s)/service(s), the
author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s
content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alkhalifa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Borkakoty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deveaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Ebshihy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Espinosa-Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iommi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez-Saez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liakata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Madabushi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Medina-Alias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <article-title>Extended overview of the clef 2024 longeval lab on longitudinal evaluation of model performance</article-title>
          ,
          <source>in: CLEF 2024: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-213.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Dias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jorge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nunes</surname>
          </string-name>
          ,
          <article-title>Survey of temporal information retrieval and related applications</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>47</volume>
          (
          <year>2016</year>
          )
          <volume>15</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          :
          <fpage>41</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Keikha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Radinsky</surname>
          </string-name>
          , M. de Rijke,
          <article-title>Time-sensitive query auto-completion</article-title>
          ,
          <source>in: Proceedings of SIGIR</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>593</fpage>
          -
          <lpage>602</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Lda-based document models for ad-hoc retrieval</article-title>
          ,
          <source>in: Proceedings of the 29th Annual International ACM SIGIR Conference</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>178</fpage>
          -
          <lpage>185</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Contextual ranking features for web search</article-title>
          ,
          <source>in: Proceedings of the 2016 ACM SIGIR International Conference on the Theory of Information Retrieval</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>43</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Hambarde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Proenca</surname>
          </string-name>
          ,
          <article-title>Information retrieval: recent advances and beyond</article-title>
          ,
          <source>IEEE Access 11</source>
          (
          <year>2023</year>
          )
          <fpage>76581</fpage>
          -
          <lpage>76604</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Clavié</surname>
          </string-name>
          ,
          <article-title>rerankers: A lightweight python library to unify ranking methods</article-title>
          ,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2408.17344. arXiv:
          <volume>2408</volume>
          .
          <fpage>17344</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Carpineto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Romano</surname>
          </string-name>
          ,
          <article-title>A survey of automatic query expansion in information retrieval</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>44</volume>
          (
          <year>2012</year>
          )
          <fpage>1</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Google</surname>
          </string-name>
          ,
          <source>Gemini Large Language Model</source>
          ,
          <year>2025</year>
          . URL: https://gemini.google.com,
          <source>generative AI model.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deveaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez-Saez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popel</surname>
          </string-name>
          , Longevalretrieval:
          <article-title>French-english dynamic test collection for continuous web search evaluation</article-title>
          ,
          <source>arXiv preprint arXiv:2303.03229</source>
          (
          <year>2023</year>
          ). URL: https://arxiv.org/pdf/2303.03229.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Kassab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>George</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Needell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Nia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Towards a fairer non-negative matrix factorization</article-title>
          ,
          <source>arXiv preprint arXiv:2411.09847</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2411.09847.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>S. learn Developers</surname>
          </string-name>
          ,
          <article-title>Non-negative matrix factorization (nmf) in scikit-learn</article-title>
          ,
          <year>2025</year>
          . URL: https: //scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html, accessed:
          <fpage>2025</fpage>
          -05- 11.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <article-title>Non-negative matrix factorization: Techniques and applications</article-title>
          ,
          <source>in: Advances in Data Analysis</source>
          , Springer,
          <year>2025</year>
          . URL: https://faculty.cc.gatech.edu/~hpark/papers/nmf_book_chapter.pdf, accessed:
          <fpage>2025</fpage>
          -05-11.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Developers</surname>
          </string-name>
          ,
          <article-title>Luigi: A python module for workflow management</article-title>
          ,
          <year>2025</year>
          . URL: https://luigi. readthedocs.io/en/stable/, accessed:
          <fpage>2025</fpage>
          -05-10.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>SchedMD</surname>
          </string-name>
          , Slurm:
          <article-title>Simple linux utility for resource management</article-title>
          ,
          <year>2025</year>
          . URL: https://slurm.schedmd. com/documentation.html, accessed:
          <fpage>2025</fpage>
          -05-10.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          , et al.,
          <source>Pyserini: A python toolkit for reproducible information retrieval research</source>
          ,
          <year>2025</year>
          . URL: https://github.com/castorini/pyserini, accessed:
          <fpage>2025</fpage>
          -05-10.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>