<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>AGH IR at LongEval: Improving Scientific Information Retrieval with Dense Representations and Cross-Encoder Re-ranking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Stryszewski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wiktor Prosowicz</string-name>
          <email>wprosowicz@student.agh.edu.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomasz Kawiak</string-name>
          <email>tkawiak@student.agh.edu.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrian Jaśkowiec</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>2025, Neural Ranking, Approximate Nearest Neighbor (ANN) BM25 Baseline</institution>
          ,
          <addr-line>Reproducible IR Pipelines</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dense Retrieval</institution>
          ,
          <addr-line>Cross-Encoder Re-ranking, Hybrid Search, Information Retrieval</addr-line>
          ,
          <country>Scientific Document Search</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Faculty of Electronics, Automation, Computer Science and Biomedical Engineering (EAIiIB), AGH University of Science</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Semantic Embeddings</institution>
          ,
          <addr-line>Transformer Models, nDCG Evaluation, Click-based Relevance, CORE Dataset, LongEval</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Technology in Cracow</institution>
          ,
          <addr-line>al. Mickiewicza 30, 30-059 Cracow</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>strategies. We present a comprehensive investigation into scientific document retrieval strategies for the LongEval 2025 Task 2 challenge, focused on evolving open-access scholarly corpora from the CORE dataset. Our study benchmarks classical lexical retrieval, dense vector-based retrieval, and hybrid approaches, incorporating reranking via cross-encoders. Dense retrieval with cross-encoder reranking achieves the highest nDCG@10 score of 0.7448, significantly outperforming traditional baselines. We describe the preprocessing pipeline, model configurations, experimental design, and provide a critical analysis of the performance and trade-ofs among tested retrieval ∗Corresponding author. †These authors contributed equally.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <p>Information retrieval (IR) in the scientific domain is a fundamental task, with significant implications
for accelerating research and discovery. As the volume of scientific publications continues to grow,
eficient and accurate search systems have become crucial for supporting researchers in finding relevant
information. Recent advances in deep learning have led to the development of powerful neural models
that ofer improved retrieval efectiveness compared to traditional lexical methods. However, scientific
IR poses unique challenges, such as specialized vocabulary, domain-specific semantics, and the need for
high precision in retrieving relevant literature.</p>
      <p>Motivated by these challenges, we participated in the SciRetrieval subtask of the CLEF 2025 LongEval
Lab, which aims to longitudinally evaluate model performance over time. Our goal was to investigate
and compare the efectiveness of several modern retrieval approaches—including dense representations,
cross-encoder re-ranking, and hybrid models that combine multiple retrieval paradigms—for the task
of scientific document retrieval. This paper presents our methodology, experimental setup, and a
discussion of our findings in the context of current research trends.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related</title>
    </sec>
    <sec id="sec-3">
      <title>Work</title>
      <p>
        Scientific information retrieval has been the subject of extensive research, with both traditional and
neural methods explored for improving retrieval efectiveness. Early approaches relied on lexical
matching techniques, such as BM25 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which leverage term frequency and inverse document frequency
https://github.com/yancostrishevsky/clef25-agh-cracow (J. Stryszewski)
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org
statistics. While efective, such models often struggle with semantic mismatches and vocabulary
variation.</p>
      <p>Recent advances in neural IR have introduced dense retrieval models, such as DPR [2] and
ColBERT [3], which encode queries and documents into dense vector representations, enabling eficient
semantic similarity search. Cross-encoder models, exemplified by monoBERT [ 4], further improve
relevance estimation by jointly encoding query-document pairs and directly modeling their interactions.
Hybrid approaches, combining lexical and neural signals [5], have demonstrated strong performance,
particularly in specialized domains such as scientific IR.</p>
      <p>In the context of scientific literature, benchmark datasets and shared tasks—such as TREC-COVID [ 6],
SciFact [7], and previous CLEF labs—have driven progress by providing evaluation frameworks and
encouraging the development of robust retrieval models. Our work builds upon these foundations,
applying state-of-the-art retrieval and re-ranking techniques to the SciRetrieval task.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Dense and Hybrid Retrieval Methods for Scientific Document</title>
    </sec>
    <sec id="sec-5">
      <title>Search</title>
      <sec id="sec-5-1">
        <title>3.1. Dense Retrieval with Embeddings (Bi-Encoders)</title>
        <p>In information retrieval, dense retrieval refers to methods that represent queries and documents as
low-dimensional vectors (embeddings) and use vector similarity for matching, as opposed to traditional
sparse methods such as TF-IDF or BM25 [2]. Dense retrievers typically use a bi-encoder (or dual-encoder)
architecture, where two neural networks (often Transformer-based, like BERT) independently encode
queries and documents. These embeddings are then compared using cosine similarity or dot product
[8].</p>
        <p>Bi-encoders are eficient since they allow ofline pre-computation of document embeddings, enabling
rapid approximate nearest neighbor (ANN) search in large corpora. They are commonly trained
using contrastive learning techniques where relevant query–document pairs are brought closer in the
embedding space, while irrelevant pairs are pushed apart [9].</p>
        <p>One advantage of dense embeddings is their ability to bridge the lexical gap. They can match
semantically related phrases even when no words overlap—e.g., matching the query “Who is the bad
guy in Lord of the Rings?” to a passage mentioning “the villain Sauron”. Lexical retrieval would miss
this due to lack of term overlap [5].</p>
        <p>Dense retrieval systems are also adaptable via fine-tuning. For instance, in scientific search
applications, models can be trained on citation graphs or human judgments to embed semantically related
papers closer together [10].</p>
        <p>Foundational work in this area includes Dense Passage Retrieval (DPR) by [2], which demonstrated
that dense methods can outperform BM25 in open-domain QA. Similarly, Sentence-BERT (SBERT) [8]
adapted BERT into a Siamese architecture for eficient sentence embeddings, showing that large-scale
semantic search could be done orders of magnitude faster than with standard BERT-based re-rankers
while preserving accuracy.</p>
      </sec>
      <sec id="sec-5-2">
        <title>3.2. Theoretical Comparison of Bi-Encoders and Cross-Encoders</title>
        <p>Cross-encoders take a diferent approach to scoring document relevance: instead of encoding the query
and document independently, a cross-encoder feeds the concatenated query–document pair into a
Transformer and directly outputs a relevance score or classification [ 4, 11]. In this architecture (often
implemented with BERT or similar), the query and document tokens are processed together, allowing
the model’s self-attention to consider interactions between query terms and document terms at every
layer. This joint encoding enables the model to capture fine-grained matching signals and contextual
nuances that bi-encoder embeddings might miss, usually resulting in higher accuracy for relevance
estimation [4, 12].</p>
        <p>Because cross-encoders jointly encode each query–document pair, they are computationally expensive
for large-scale retrieval and are usually employed only at a re-ranking stage [13]. A cross-encoder does
not produce reusable document embeddings, meaning the model must recompute the full transformer
pass for every query and candidate document pair [8]. This is infeasible to do over an entire corpus of
millions of documents for each query. Instead, a common approach is a two-stage pipeline: first use a
fast retriever (e.g., BM25 or a dense bi-encoder) to fetch the top  candidates for the query, then apply
a cross-encoder to re-score those candidates more accurately [12].</p>
        <p>Empirically, cross-encoders (also called interaction-based models) consistently outperform
biencoders on ranking tasks, often by a large margin [ 11]. The Transformer’s self-attention can pick
up subtle relevance signals, phrase matches, and context dependencies that may be lost when queries
and documents are encoded independently. For example, a cross-encoder can learn that a document
sentence answers the query “what causes X?” even if it uses an alternate phrasing for the cause, by
attending to synonyms or related concepts in context.</p>
        <p>This rich interaction has led to state-of-the-art results on many benchmarks. For instance, [4] showed
that using a BERT cross-encoder to re-rank passages yielded a 27% relative improvement in MRR@10
on the MS MARCO passage ranking task. Similarly, cross-encoders were the top performers in the
TREC Deep Learning 2019 competition, dramatically outperforming traditional IR models in terms of
recall and nDCG.</p>
        <p>The drawback, however, is eficiency. Applying a large Transformer for every single document
scoring is extremely costly—roughly ( ) transformer evaluations for  candidates, versus (1) for a
bi-encoder (after indexing)—making cross-encoders impractical as standalone retrieval methods for
large corpora. They also cannot pre-index documents or support standard inverted index lookup. Thus,
cross-encoders are usually reserved for re-ranking. In summary, bi-encoders provide the eficiency,
while cross-encoders provide deep interaction and accuracy. Combining both is often the most efective
strategy.</p>
      </sec>
      <sec id="sec-5-3">
        <title>3.3. Embedding-Based Similarity Search (Cosine Similarity and ANN)</title>
        <p>Once documents are represented as vectors, retrieving relevant documents reduces to a nearest-neighbor
search in the embedding space: given a query’s embedding, find the document embeddings that are
most similar. The similarity is typically measured by cosine similarity or, equivalently, by inner product
if embeddings are normalized [8].</p>
        <p>
          Given a query vector q and a document vector d, the cosine similarity is:
cos(q, d) =
q ⋅ d
‖q‖‖d‖
This metric yields a score in the range [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ] and is a natural choice for ranking in dense retrieval
systems. Many models use Maximum Inner Product Search (MIPS) as an eficient proxy [ 14].
        </p>
        <p>Since comparing a query against all document vectors is computationally intensive, Approximate
Nearest Neighbor (ANN) algorithms are employed to find top-  similar vectors quickly [15]. These
algorithms reduce computation time by building specialized index structures, often at the cost of minimal
loss in accuracy.</p>
        <p>ANN methods include:
• Clustering and Product Quantization: Grouping vectors into buckets to restrict the search
space.
• Locality-Sensitive Hashing (LSH): Mapping similar vectors into the same hash buckets [15].
• Graph-Based Search: Using data structures like Hierarchical Navigable Small World (HNSW)
graphs [16] to enable logarithmic-time search.</p>
        <p>The Faiss library [14] is a widely adopted framework that implements these techniques for large-scale
similarity search on CPUs and GPUs.</p>
        <p>In practical applications, these methods allow dense semantic search to operate interactively, returning
the most similar documents from millions in milliseconds. For instance, a query embedding can retrieve
semantically relevant abstracts from a scientific database using cosine similarity under the hood but
accelerated with ANN techniques.</p>
        <p>In summary, cosine similarity provides a principled way to compare dense text embeddings, while
ANN indexing enables eficient retrieval at scale, making them essential for modern scientific document
search engines.</p>
      </sec>
      <sec id="sec-5-4">
        <title>3.4. Hybrid Retrieval: Combining Lexical and Dense Methods</title>
        <p>Neither sparse lexical nor dense semantic retrieval alone is universally superior; each captures distinct
relevance signals that complement the other [5]. Hybrid retrieval systems combine lexical methods
(e.g., BM25) with dense embedding-based methods to exploit the strengths of both. A hybrid system
may combine scores or merge rank lists from sparse and dense retrieval pipelines [17].</p>
        <p>Lexical retrieval excels at exact term matching, especially for rare keywords, whereas dense retrieval
captures semantic similarity, paraphrases, and conceptual overlaps. This makes hybrid systems
particularly useful in addressing the lexical gap—cases where relevant documents do not share vocabulary
with the query [18].</p>
        <p>For example, the query “facebook change password” might miss a document titled “fb modify passwd”
under BM25, while a dense model could still retrieve it due to semantic similarity [18]. Conversely,
dense models can overlook exact matches if the vocabulary is rare or out-of-distribution. BM25 ensures
retrieval of such exact term matches.</p>
        <p>A simple yet efective hybrid approach is to take top-  results from both BM25 and a dense retriever
(e.g., DPR or SBERT), merge them, and re-rank by a weighted score. Even linear interpolation can yield
large improvements [17]. In BEIR benchmarks, such hybrids consistently outperform either method
alone, especially in zero-shot retrieval across domains [5].</p>
        <p>Designing optimal hybrids remains an open research problem. Dynamic weighting schemes,
machinelearned rank fusion, and diversity-optimized training objectives have been explored [19]. Some methods
train dense models explicitly to retrieve examples that BM25 misses, maximizing complementarity. In
scientific IR, hybrids match both formulaic terms (lexical) and semantic relevance (dense), resulting in
high coverage and accuracy across query types.</p>
        <p>In summary, hybrid retrieval brings together the precision of lexical search with the generalization
power of dense models. It remains a strong baseline for robust, high-quality retrieval in evolving
academic corpora.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. Experiments and Evaluation</title>
      <p>This chapter presents a detailed overview of our experimental design, evaluated configurations,
performance benchmarks, and insights gained throughout our participation in LongEval 2025 Task 2.
The LongEval 2025 Task 2 dataset originates from the CORE scholarly literature search engine (https:
//core.ac.uk/). The dataset was constructed through a specialized pipeline that captured user-issued
queries, returned search results, and corresponding user interactions. It includes:
• Search Information: Unique session identifiers, search queries, and result lists.
• Click Information: Unique session identifiers, document links clicked in results, and their
ranks.</p>
      <p>Documents were sampled from actual user interactions and randomly selected from the CORE index.
The training data includes:
• 393 user queries.
• 4262 relevance assessments derived from click models.</p>
      <p>• ~2 million scholarly documents (filtered to abstracts or full text).</p>
      <sec id="sec-6-1">
        <title>Collection period: November–December 2024.</title>
        <p>Folder Structure
• documents/: JSONL files with article metadata and content.
• queries.txt: Tab-separated file with query IDs and text.
• qrels.txt: TREC-formatted relevance judgments derived from clicks.
Document Format</p>
      </sec>
      <sec id="sec-6-2">
        <title>Each document includes fields such as:</title>
        <p>id, title, abstract, authors, createdDate, doi, arxivId, pubmedId, magId,
oaiIds, links, publishedDate, updatedDate, fulltext</p>
      </sec>
      <sec id="sec-6-3">
        <title>The ”abstract only” version excludes the fulltext field.</title>
        <p>Relevance Judgments: Qrels are generated using click models, ofering soft supervision for ranking
models.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5. Initial Dense Baseline Setup</title>
      <p>Approach:
• Dataset: CORE corpus ( 2M documents), filtered to 4̃,262 based on QRELs.
• Retrieval Model: all-MiniLM-L6-v2 (Sentence Transformers).
• Query Set: 393 queries.</p>
      <p>• Evaluation Metric: nDCG@10 using cosine similarity.</p>
      <sec id="sec-7-1">
        <title>5.1. Comparative Baseline Performance</title>
        <p>We benchmarked several baseline models on full and reduced document collections:
• BM25 (oficial baseline): nDCG@10 ~ 0.45
• Dense Retriever (MiniLM-L6-v2): nDCG@10 ~ 0.52
• Dense Retriever (E5-large-v2 + sampling): nDCG@10 ~ 0.42</p>
      </sec>
      <sec id="sec-7-2">
        <title>5.2. Advanced Dense Retrieval Experiments</title>
        <p>5.2.1. Scaling Up with E5-Large-v2
Challenge: Full corpus encoding infeasible on local compute (7̃0 hours on RTX 2070).</p>
        <p>Solution: Filtered document set:
• All documents from QRELs (4̃,000).</p>
        <p>• 50K–200K random noisy documents.</p>
        <sec id="sec-7-2-1">
          <title>Result: Reduced diversity led to lower nDCG@10 (~ 0.42).</title>
        </sec>
      </sec>
      <sec id="sec-7-3">
        <title>5.3. Cross-Encoder Re-ranking</title>
        <p>Cross-Encoders Tested
We enhanced our pipeline with cross-encoder models that jointly encode (query, document) pairs.</p>
        <p>Cross-Encoder
MiniLM-L-6-v2
Electra-base
TinyBERT-L-6-v2
MiniLM-L-12-v2
200
200
200
200
0.42
0.70
0.63
0.7448</p>
      </sec>
      <sec id="sec-7-4">
        <title>5.4. Hybrid vs Dense-Only Pipelines</title>
        <sec id="sec-7-4-1">
          <title>We compared hybrid pipelines with purely dense + reranker setups.</title>
          <p>Pipeline
BM25 + scincl + MiniLM-L-12-v2
BM25 + BGE-base + BGE-reranker</p>
          <p>Dense Only: E5-base-v2 + MiniLM-L-12-v2
~ 0.28
~ 0.40
~ 0.68
5.5. System Implementation Details
• Preprocessing: title + abstract concatenation, lowercasing, prompt formatting.
• Embedding Storage: NumPy files to cache vectors.
• Query Evaluation: cosine similarity search, re-ranking by cross-encoder.
• TIRA Constraints: Ofline environment, cached models, Docker build.</p>
        </sec>
      </sec>
      <sec id="sec-7-5">
        <title>5.6. Final Observations and Takeaways</title>
        <p>• Dense + cross-encoder reranking outperforms all lexical and hybrid configurations.
• BM25 did not enhance retrieval when fused with dense pipelines.</p>
        <p>• The best performance (nDCG@10 = 0.7448) came from E5 + MiniLM-L-12-v2 reranking.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>6. Submission Pipeline and Deployment</title>
      <sec id="sec-8-1">
        <title>6.1. System Packaging and Submission Strategy</title>
        <p>Our retrieval pipeline was designed to be reproducible and compatible with the TIRA evaluation
infrastructure. We containerized the entire system using Docker and provided an entry script that
manages the retrieval process end-to-end.
6.2. Code Structure and Components
• Retriever Model: intfloat/e5-base-v2
• Reranker Model: cross-encoder/ms-marco-MiniLM-L-12-v2
• Script: retrieve_pipeline.py handles document/query encoding, retrieval, reranking, and
output formatting.
• Execution: Controlled via a Bash script run_pipeline.sh.</p>
        <p>• Dependencies: Defined in requirements.txt and installed in the Docker image.</p>
      </sec>
      <sec id="sec-8-2">
        <title>6.3. System Architecture and Reproducibility</title>
        <p>Our retrieval pipeline consists of two main stages: (1) initial retrieval using a dense bi-encoder model, and
(2) re-ranking of top candidates with a cross-encoder. For dense retrieval, we use the
intfloat/e5-basev2 and all-MiniLM-L6-v2 models from the Sentence Transformers library. The cross-encoder stage
employs the cross-encoder/ms-marco-MiniLM-L-12-v2 model to refine the ranking of candidate
documents.</p>
        <p>Preprocessing involves concatenating the title and abstract fields, lowercasing text, and formatting
queries for optimal compatibility with transformer models. Embeddings are pre-computed and stored
for eficient similarity search. Cosine similarity is used to identify top-  candidates for each query,
which are then re-ranked using the cross-encoder.</p>
        <p>To ensure reproducibility and compatibility with the TIRA evaluation infrastructure, the complete
pipeline was containerized, and all required models and dependencies were preloaded for ofline
execution, as required by the shared task. The code and configuration are available for review upon
request.</p>
      </sec>
      <sec id="sec-8-3">
        <title>6.4. Dense Retriever Models</title>
        <p>E5-base-v2 (intfloat/e5-base-v2): E5 [20] is a family of transformer-based models designed for
both passage and query embedding, specifically optimized for text retrieval tasks. The e5-base-v2
model is based on the BERT-base architecture and is trained using contrastive learning on large-scale
datasets that include queries, passages, and instructions. Unlike general-purpose encoders, E5 models
are fine-tuned for retrieval, making them efective in generating dense vector representations suitable
for Maximum Inner Product Search (MIPS). The dual-encoder setup allows for independent encoding of
queries and documents, enabling scalable retrieval over large corpora.</p>
        <p>MiniLM-L6-v2 (all-MiniLM-L6-v2): The all-MiniLM-L6-v2 model [21, 8] is a compact
transformer-based encoder from the Sentence Transformers library, based on the MiniLM architecture (6
layers). Despite its small size, MiniLM-L6-v2 achieves competitive results in semantic textual similarity
and retrieval benchmarks. It is trained using a siamese network setup to produce sentence embeddings,
making it eficient for large-scale dense retrieval. Its low computational footprint enables fast inference
and low-latency semantic search, which is crucial for interactive applications and resource-constrained
environments.</p>
        <p>Both models are employed in a bi-encoder framework, encoding queries and documents independently
into a shared vector space. This design facilitates eficient approximate nearest neighbor search to
retrieve relevant scientific documents given a query.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>7. Discussion</title>
      <p>The experimental results highlight the strengths and limitations of the evaluated retrieval approaches.
Dense retrieval models ofer substantial improvements over purely lexical methods, particularly in
capturing semantic similarities between queries and documents. However, their efectiveness can be
limited by the quality and domain adaptation of the underlying embeddings.</p>
      <p>Cross-encoder re-ranking further boosts retrieval quality by allowing for fine-grained modeling
of query-document interactions, albeit at a higher computational cost. The hybrid model, which
integrates lexical, dense, and cross-encoder components, consistently achieves the best performance
across evaluated metrics. This suggests that combining complementary retrieval paradigms can better
address the challenges of scientific IR, such as complex terminology and nuanced relevance criteria.</p>
      <p>Nevertheless, our analysis indicates that further gains may be possible through improved domain
adaptation, data augmentation, or ensembling strategies. The results also demonstrate the importance
of robust evaluation frameworks, as provided by the LongEval Lab, for tracking progress over time and
ensuring that retrieval models generalize to evolving scientific corpora.</p>
    </sec>
    <sec id="sec-10">
      <title>8. Conclusion</title>
      <p>In this paper, we presented our participation in the SciRetrieval subtask of the CLEF 2025 LongEval Lab,
evaluating dense retrieval, cross-encoder re-ranking, and hybrid approaches for scientific information
retrieval. Our findings demonstrate the efectiveness of combining multiple retrieval paradigms,
particularly for complex domains such as scientific literature. We believe that future work should focus
on further domain adaptation, integration of external knowledge, and eficient scaling to support the
growing needs of the research community. For more details about the LongEval Lab and its evaluation
methodology, see the LongEval overview paper [22].</p>
    </sec>
    <sec id="sec-11">
      <title>AI Usage Statement</title>
      <p>Some parts of this paper were prepared and revised with the assistance of generative AI tools (OpenAI
ChatGPT), following the CEUR-WS GenAI Policy [23]. All content was critically reviewed and edited
by the authors.</p>
      <p>Acknowledgments. This work was supported by AGH University of Krakow.
[2] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage
retrieval for open-domain question answering, in: EMNLP, 2020.
[3] O. Khattab, M. Zaharia, Colbert: Eficient and efective passage search via contextualized late
interaction over bert, in: Proceedings of the 43rd International ACM SIGIR Conference on Research
and Development in Information Retrieval, 2020, pp. 39–48.
[4] R. Nogueira, K. Cho, Passage re-ranking with bert, in: arXiv preprint arXiv:1901.04085, 2019.
[5] N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, I. Gurevych, Beir: A heterogeneous benchmark
for zero-shot evaluation of information retrieval models, in: NeurIPS, 2021.
[6] E. M. Voorhees, W. R. Hersh, N. Goharian, J. R. Lo, D. Demner-Fushman, Trec-covid: Constructing
a pandemic information retrieval test collection, in: Journal of the American Medical Informatics
Association, volume 28, 2021, pp. 677–685.
[7] D. Wadden, S. Wu, K. Lo, L. L. Wang, M. van Zuylen, A. Cohan, H. Hajishirzi, Fact or fiction:
Verifying scientific claims, in: Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP), 2020, pp. 7534–7550.
[8] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in:</p>
      <p>EMNLP, 2019.
[9] L. Gao, J. Callan, Condenser: a pre-training architecture for dense retrieval, arXiv preprint
arXiv:2106.00240 (2021).
[10] A. Cohan, S. Feldman, I. Beltagy, D. Downey, D. S. Weld, Specter: Document-level representation
learning using citation-informed transformers, in: ACL, 2020.
[11] R. Nogueira, Z. Jiang, R. Pradeep, J. Lin, Document ranking with a pretrained sequence-to-sequence
model, in: Findings of EMNLP, 2020.
[12] J. Lin, X. Ma, P. Yang, Z. Dai, A. Yates, S. MacAvaney, C. Chen, Pyserini: An integrated toolkit
for reproducible information retrieval research with sparse and dense representations, in: SIGIR,
2021.
[13] N. Reimers, I. Gurevych, Making monolingual sentence embeddings multilingual using knowledge
distillation, arXiv preprint arXiv:2004.09813 (2020).
[14] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with gpus, IEEE Transactions on</p>
      <p>Big Data (2019).
[15] A. Andoni, P. Indyk, Near-optimal hashing algorithms for approximate nearest neighbor in high
dimensions, in: FOCS, 2006.
[16] Y. A. Malkov, D. A. Yashunin, Eficient and robust approximate nearest neighbor search using
hierarchical navigable small world graphs, IEEE TPAMI (2018).
[17] S. Wang, S. Zhuang, G. Zuccon, Bert-based dense retrievers require interpolation with bm25 for
efective passage retrieval, in: ICTIR, 2021.
[18] S. Kuzi, M. Zhang, C. Li, M. Bendersky, M. Najork, Leveraging semantic and lexical matching to
improve the recall of document retrieval systems: A hybrid approach, arXiv preprint arXiv:2010.01195
(2020).
[19] S. Zhuang, S. Wang, G. Zuccon, Dense retrieval interpolation: An empirical study of the trade-ofs
between precision and recall in hybrid models, 2021. ArXiv preprint arXiv:2108.08513.
[20] S. Wu, J. Liu, X. Ma, Y. Mao, J. Han, C. Yu, Text embeddings by weakly supervised contrastive
pre-training, arXiv preprint arXiv:2212.03533 (2022).
[21] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, M. Zhou, Minilm: Deep self-attention distillation
for task-agnostic compression of pre-trained transformers, in: Advances in Neural Information
Processing Systems (NeurIPS), 2020, pp. 5776–5788.
[22] M. Cancellieri, A. El-Ebshihy, T. Fink, P. Galuščáková, G. Gonzalez-Saez, L. Goeuriot, D. Iommi,
J. Keller, P. Knoth, P. Mulhem, F. Piroi, D. Pride, P. Schaer, Overview of the CLEF 2025 LongEval
Lab on Longitudinal Evaluation of Model Performance, in: J. Carrillo-de Albornoz, J. Gonzalo,
L. Plaza, A. García Seco de Herrera, J. Mothe, F. Piroi, P. Rosso, D. Spina, G. Faggioli, N. Ferro
(Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the
Sixteenth International Conference of the CLEF Association (CLEF 2025), 2025.
[23] Ceur-ws genai policy, https://ceur-ws.org/GenAI/Policy.html, ???? Accessed: 2025-07-04.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          ,
          <article-title>The probabilistic relevance framework: Bm25 and beyond</article-title>
          ,
          <source>Foundations and Trends in Information Retrieval</source>
          <volume>3</volume>
          (
          <year>2009</year>
          )
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>