<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AIRwaves at CheckThat! 2025: Retrieving Scientific Sources for Implicit Claims on Social Media with Dual Encoders and Neural Re-Ranking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cem Ashbaugh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leon Baumgärtner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Greß</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikita Sidorov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Werner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Wien</institution>
          ,
          <addr-line>Karlsplatz 13 A-1040 Wien</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Linking implicit scientific claims made on social media to their original publications is crucial for evidence-based fact-checking and scholarly discourse, yet it is hindered by lexical sparsity, very short queries, and domain-specific language. Team AIRwaves ranked second in Subtask 4b of the CLEF-2025 CheckThat! Lab with an evidenceretrieval approach that markedly outperforms the competition baseline. The optimized sparse-retrieval baseline (BM25) achieves MRR@5 = 0.5025 on the gold label blind test set. To surpass this baseline, a two-stage retrieval pipeline is introduced: (i) a first stage that uses a dual encoder based on E5-large, fine-tuned using in-batch and mined hard negatives and enhanced through chunked tokenization and rich document metadata; and (ii) a neural re-ranking stage using a SciBERT cross-encoder. Replacing purely lexical matching with neural representations lifts performance to MRR@5 = 0.6174, and the complete pipeline further improves to MRR@5 = 0.6828. The ifndings demonstrate that coupling dense retrieval with neural re-rankers delivers a powerful and eficient solution for tweet-to-study matching and provides a practical blueprint for future evidence-retrieval pipelines.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Claim Source Retrieval</kwd>
        <kwd>Neural Representation Learning</kwd>
        <kwd>Neural Re-ranking</kwd>
        <kwd>Dual Encoder</kwd>
        <kwd>Cross Encoder</kwd>
        <kwd>BERT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• Domain specificity: Scientific terminology and varied writing styles demand models that
understand technical language in both social media and scholarly text.</p>
      <p>To guide the exploration of efective retrieval strategies for the task, the following research questions
are formulated:
RQ1: Which document metadata is most useful to create embeddings that optimize retrieval performance
for the task at hand?
RQ2: Does incorporating BM25-mined hard negative training examples for training improve
query-todocument retrieval performance?
RQ3: To which degree does re-ranking retrieved results using cross-encoders improve the retrieval
performance?</p>
      <p>The remainder of this paper is structured as follows. Section 2 provides an overview of prior
approaches: from classical lexical ranking (BM25) and dual encoder retrieval (with in-batch and
hardnegative sampling) to neural re-ranking methods (cross-encoders like SciBERT/MedBERT) and
multistage pipelines that combine sparse and dense retrieval for improved semantic matching. Section 3
describes the data at hand. Section 4.2 introduces the two-stage pipeline, combining sparse/dense
retrieval with neural re-ranking. Section 4.2.2 introduces neural re-ranking models and training
strategies. Section 5 reports experiments on train, development, and test splits. Finally, Section 6
concludes and outlines future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>As the specific task of post-to-publication matching is not yet well-established in the literature, related
areas of research can be considered. In a study with a similar dataset from the IRMiDis track at Forum
for Information Retrieval Evaluation 2022, the authors show that BERT significantly outperforms other
algorithms in analysing and classifying COVID-19-related tweets, efectively capturing the underlying
meaning of the text [5]. Another relevant approach is presented in the CO-Search system, which was
developed to address COVID-19 information retrieval tasks using a multi-stage pipeline combining
deep learning-based semantic retrieval (Siamese-BERT) with traditional keyword-based models (BM25),
followed by a re-ranking step that integrates question answering and abstractive summarization modules
[6].</p>
      <p>The similar task of linking tweets to news articles was introduced by Guo et al. [7], who proposed a
Weighted Textual Matrix Factorization method for learning representations. Due to fast advancements
of deep neural approaches in representation learning, many promising methods have been suggested
since. Reimers and Gurevych [8] provided foundational work in adapting the BERT architecture to a
dual encoder framework well-suited for retrieval tasks, which is the basis of much of the subsequent
literature. In the domain of post-to-news-article matching, Danovitch [9] proposes a custom-made
dual encoder to minimize distance between relevant post-article pairs in a joined embedding space
while also employing a sequence-length agnostic tokenization technique, which we will adjust to use in
our work. Piotrowski et al. [10] present a method difering from the one used in the present work by
using separate encoders for posts (Twitter RoBERTa) and articles (RoBERTa), addressing the stylistic
mismatch between social media posts with informal language and publications formulated in scientific
language.</p>
      <p>Karpukhin et al. [11] evaluated the influence of using BM25-sampled hard negative documents
(i.e. similar to the positive one for a given query) as additional training examples, demonstrating
improvements over simple in-batch negatives. In-batch negatives are stated to be a training approach
that eficiently uses negative documents from other queries within a batch instead of creating new
ones, which is hypothesized to scale well with larger batch sizes [11]. Gillick et al. [12] use a dual
encoder approach with in-batch and optional additional hard negative examples, demonstrating high
efectiveness even without typical second-stage re-ranking of results retrieved in the first stage.</p>
      <p>Neural re-ranking, in which deep language models reorder an initial candidate list, markedly improves
retrieval efectiveness in fact checking and question answering applications [ 13]. Mansour et al. [14]
proposed a retrieval based re-ranking pipeline that retrieves candidate claims from a corpus of
factchecked statements and reranks them with Sentence-BERT to mitigate the linguistic variability of social
media content.</p>
      <p>MacAvenay et al. [15] proposed a two-stage retrieval pipeline with an initial BM25 model that
produces candidate-specific papers and a SciBERT-based cross-encoder after for COVID-19 Literature
Search. Their study showed that adding a domain-specific transformer re-ranker substantially improved
efectiveness, even under zero-shot conditions. In the present work, the same two-stage pattern was
followed; however,the retrieval stage is dense rather than sparse.</p>
      <p>An even more domain-specific BERT-based language model was introduced by Vasantharajan et al.
[16]. MedBERT is pretrained on biomedical entity recognition datasets. By leveraging task-specific
pretraining, MedBERT was shown to improve downstream biomedical NLP tasks.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset and Experimental Setup</title>
      <p>The dataset comprises tweet-study pairs, sourced from the Altmetric corpus and the CORD-19 dataset
[17]. Explicit references such as URLs were removed to simulate real-world scenarios of implicit
scientific discourse. Each tweet in the dataset includes informal references to a scientific study, while
each scientific paper includes metadata such as the title, abstract, authors, source and journal. The final
collection contains 15 699 tweet-study pairs, ensuring a diverse set of claims and associated scientific
evidence. For experimentation, the original data split was used, with 12 853, 1 400 and 1 446 datapoints
for training, development and test sets, respectively. The document collection contains a total of 7 718
publications.</p>
      <p>Figure 1 compares the distribution of token lengths across train, dev and test datasets. The histograms
visualize the length of the queries in BERT tokens. While the train and dev sets exhibit remarkably
similar distributions, with peaks occurring in comparable token length ranges, the test set shows a
noticeable shift toward shorter queries. This diference is particularly evident in the higher frequency of
queries with 10 to 30 tokens in the test set compared to the other two datasets. The test set’s deviation
may reflect inherent diferences in data collection or curation processes.</p>
      <p>1500
y
c
en 1000
u
q
e
r
F
500
0</p>
      <sec id="sec-3-1">
        <title>Train Queries Dev Queries 150 100</title>
        <p>50
0
50
100</p>
      </sec>
      <sec id="sec-3-2">
        <title>Tokens 50 100</title>
      </sec>
      <sec id="sec-3-3">
        <title>Tokens</title>
        <p>200
150
100
50
0</p>
      </sec>
      <sec id="sec-3-4">
        <title>Test Queries 50 100</title>
      </sec>
      <sec id="sec-3-5">
        <title>Tokens</title>
        <p>The consistent use of BERT tokenization across all datasets rules out tokenization artifacts as a
potential cause for this variation, suggesting the diferences stem from the underlying data composition
itself. The observed length discrepancy between training and test queries should be accounted for
during evaluation, as models may exhibit biased performance across diferent query lengths.</p>
        <p>All experiments were conducted on the Lightning AI platform1, utilizing NVIDIA H100 GPUs (80 GB
memory, 26 CPUs, 1 513 TFLOPs) for resource-intensive tasks and T4 GPUs (16 GB memory, 8 CPUs,
125 TFLOPs) for standard workloads. For reproducibility, all environment settings such as Torch and
Cuda were bound to a random seed.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>This section outlines the methodology used to address the scientific claim source retrieval task in the
CheckThat! CLEF Challenge 2025 and to answer the research questions guiding the present work. First,
the dataset used for the experiments is described. Second, sparse retrieval methods are introduced
to establish a baseline for comparison. Third, the proposed two-stage retrieval pipeline is outlined,
consisting of the phases neural representation learning and neural re-ranking. Finally, the evaluation
metrics applied to assess retrieval performance are presented.</p>
      <sec id="sec-4-1">
        <title>4.1. Sparse Retrieval: BM25</title>
        <p>BM25 is a probabilistic Okapi–style ranking function that scores a document  for a query  by combining
term frequency saturation, document-length normalization, and inverse-document frequency [18, 19].
Equivalently, it can be written:
score(, ) =
∑︁</p>
        <p>,
∈∩ 1(︀ (1 − ) +   ︀) + ,
⏟ TF satur⏞ation
+ length norm.</p>
        <p>× log︁( || −  + 0.5 )︁</p>
        <p>+ 0.5
⏟</p>
        <p>RSJ ID⏞F
(1)
where:
• ,: frequency of term  in document .
• : length of  in tokens; : average document length in the collection.
• : number of documents containing term ; ||: total number of documents.
• 1 &gt; 0: term-frequency saturation parameter (larger 1 → slower saturation).</p>
        <p>
          •  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]: length-normalization parameter ( = 0 disables,  = 1 full normalization).
        </p>
        <p>Conceptually, BM25 refines classic TF–IDF by applying a probabilistic RSJ-style IDF weight and by
normalizing TF according to document length and saturation parameters [19]. Equation 1 fully specifies
the scoring function used in this work.</p>
        <p>Preprocessing and indexing. Titles and abstracts are concatenated and processed with spaCy,
disabling the parser and NER to reduce memory consumption [20]. Tokens are lemmatized, lower-cased,
ifltered to retain only alphabetic or numeric items, and stop-words are removed using the 547-term
list bundled with en_core_web_sm2. The resulting sequences are indexed with rank-bm25 v0.2.2’s3
BM25 Okapi implementation. The default parameters (1, ) = (1.5, 0.75) were used for comparability
with the of-the-shelf BM25 model. An in-memory dictionary caches the top-  lists for previously seen
queries, eliminating redundant scoring during evaluation.
1Lightning AI. Lightning AI Platform. Available at: https://lightning.ai (accessed May 2025)
2https://spacy.io/models/en
3https://pypi.org/project/rank-bm25/
Query processing and retrieval. Tweets in the query set pass through the same preprocessing
pipeline. The  = 10 highest-scoring papers are retrieved for each tweet and supplied as the candidate
pool for downstream re-rankers. The identical cut-of (  = 10) is used when computing all sparse
retrieval metrics reported in Section 5.1.</p>
        <p>Rationale. Recent large-scale studies demonstrate that, despite advances in dense and hybrid retrieval,
BM25 remains a strong, reproducible baseline and often provides complementary or superior signals
on short-text tasks [21, 22]. Including it, therefore, furnishes a well-understood reference point and
facilitates comparability across systems in CLEF CheckThat! 2025 Task 4b.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Retrieval-Pipeline</title>
        <p>This section outlines the present work’s experiments and their implementation specifics regarding the
suggested two-stage retrieval pipeline.
4.2.1. Phase 1: Neural Representation Learning
In the following, the details of the experiment implementations for stage 1 are presented. These include
the base setup, document metadata experiments and hard negative implementations within the neural
representation learning approach. This section addresses the methodology of addressing research
questions 1 and 2.</p>
        <p>Base Setup. Initially, a base setup was developed that allows fine-tuning models while keeping all
other hyperparameters fixed to determine the most promising dual encoder architecture. Additionally,
Word2Vec was explored with a diferent setup, as it is not a dual encoder architecture. The models
considered for this trial are shown in Table 1. Word2Vec was applied to learn word embeddings from
the corpus to evaluate its suitability for the retrieval task. Two settings were tested: training on the
raw text without preprocessing and training on a spaCy-preprocessed corpus, where tokenization,
lemmatization, lowercasing, and stopword removal were applied 4.</p>
        <p>For the base models all-MiniLM-L6-v25 (L6), multi-qa-mpnet-base-dot-v16 (MPNet),
msmarco-bertbase-dot-v57 (MSMarco) and intfloat/e5-large-v2 8 (E5), the pre-trained implementations from the
sentence transformers library are applied.</p>
        <p>The rationale behind the choice is, for one, evaluating a diverse set of model parameter sizes (see
Table 1). L6 was chosen to evaluate the capacity of small models as well as fast, initial prototyping to
determine a suitable setup. All models are pre-trained for sentence- and paragraph-matching using
diferent datasets. MPNet and MSMarco excel at multi-domain question-answer retrieval. E5 is the
most general-purpose embedding model and the largest in terms of trainable parameters [23].
4https://spacy.io/usage/processing-pipelines
5https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
6https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1
7https://huggingface.co/sentence-transformers/msmarco-bert-base-dot-v5
8https://huggingface.co/intfloat/e5-large-v2</p>
        <p>The results for the base setup are shown in Table 5. To verify the influence of batch size on the in-batch
negative training setting, the batch sizes [8, 16, 32, 64] were explored using E5. Other hyperparameter
settings were chosen according to literature and tests using the dev set. The best learning rate was
determined to be 7× 10− 6. Given the relatively small amount of training data, 2 epochs were suficient for
optimal results, after which overfitting sets in. To prevent large destabilizing updates early in fine-tuning,
the first 10% of all training steps were set as warm-up steps, during which the learning rate is linearly
increased to the target value. The distance of learnt document and query embeddings in the shared
vector space are calculated using cosine similarity. The construction of input examples and a suitable
loss function are crucial components to efective retrieval using dual encoders. Training examples are
constructed as (query, positive document) pairs. This allows training on in-batch negatives, which has
been found to be efective for similar retrieval tasks [ 12]. For each query, the positive documents of
the other queries in the batch serve as negative examples. This can be efectively implemented using
MultipleNegativesRankingLoss9. Therefore, the loss function follows formula 2,
ℒ = −</p>
        <p>1 ∑︁ log</p>
        <p>︃(
=1</p>
        <p>exp (scale · sim(, ))
∑︀=1 exp (scale · sim(,  ))
)︃
(2)
where  is the batch size, sim() denotes the cosine similarity and scale is a temperature scaling
parameter that is set to 20 in this work.  and  represent a query-positive document pair. The
denominator sums over all positives in the batch, so all other positives serve as in-batch negatives for a
given query.</p>
        <p>For the base setup exploring the efectiveness of the models in Table 1 (excluding word2vec), the
following preprocessing is applied. Query and document texts are normalized, lower-cased and words
are separated by single white spaces. To represent document information, title and abstract are chosen
initially. Diferent document metadata are separated using an explicit [SEP] token. Queries are encoded
based solely on the social media post’s text. The base setup relies on simple truncation for query and
document inputs. If either exceeds the model-intrinsic maximum sequence length (typically 512 tokens
for transformer-based architectures), all tokens above the limit are cut of. Table 2 shows the statistics
of model input sequence lengths for queries as well as the document lengths using diferent document
metadata combinations.
Incorporating Document Metadata. To address truncation-induced information loss, the method
of chunked tokenization is adapted from Danovitch [9] and combined with the suggested mean- and
max-pooling from Lee, Gallagher and Tu [24] to combine the chunks meaningfully. The input sequences
are divided into chunks of 510 tokens with an overlap of 50. Out of each chunk, one vector is created
through mean-pooling. The individual chunks are then combined into a single fixed-size per-document
9https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss
h =
1
2
︃(

=1

1 ∑︁  () +</p>
        <p>max
=1,...,
 ()
)︃
where  is the number of chunks and  () the encoder output for chunk . The left summand
computes the mean-pooled embedding, the right the max-pooled embedding. The combination is meant
to combine both the “typical“ and “exceptional“ content.</p>
        <p>Using this setup, the best-performing model architecture identified using the base setup is used to
determine the most informative document metadata. To control for the above-mentioned changes in
modelling technique, the original Title + Abstract combination for document metadata was evaluated
ifrst, before exploring diferent combinations. Preprocessing applied depends on the metadata field.
Multiple authors and sources are split on “;“, journal contains single entries and requires no splitting.
The used fields of a trial are then combined using [SEP] and normalized as before. The results obtained
using chunked tokenization are shown in Table 7.</p>
        <p>Adding Hard Negatives.</p>
        <p>Karpukhin et al. [11] have shown that using one additional hard negative
example per query can outperform training on in-batch negatives only. The rationale is to provide the
model with examples that are hard to distinguish, forcing it to learn finer semantic patterns. Gillick et al.
[12] suggest that this approach is particularly suitable for retrieval tasks where only one document is a
correct query partner. This approach is adapted to match the task addressed in this work. BM25 is used
to mine hard negative candidates per query. Training examples are now constructed as (query, positive
document, hard negative document), where hard negative documents are selected from top BM25 hits for
the given query, excluding the positive document. These can be passed to MultipleNegativeRankingLoss,
leading to a slightly adjusted formula 4 for the loss,
vector using mean- and max pooling. This embedding creation process for documents is applied during
training as well as evaluation and follows formula 3,
(3)
(4)
1 ∑︁ log</p>
        <p>︃(
=1</p>
        <p>exp (scale · sim(, ))
∑︀=1 exp (scale · sim(, , ))
)︃
where  is the number of all in-batch positives plus any hard negatives passed and , is the -th
candidate document for the given query, either an in-batch negative or any included hard negative. In
this experiment, exactly one hard negative example was added to each (query, positive document) pair.</p>
        <p>Finally, a combination of the aforementioned hard negative and chunked tokenization approach was
applied. All hyperparameters are kept fixed and the document fields title, abstract, authors, journal
and source were used to construct document representations. The results applying these training
approaches are displayed in Table 8.
4.2.2. Phase 2: Neural Re-Ranking
After the dual encoder stage had produced the initial candidate list of 100 candidate documents per
query, the second stage neural re-ranking module was applied to refine the results.</p>
        <p>Cross-Encoder Transformer Rerankers.</p>
        <p>A cross-encoder jointly processes the query and candidate
document, allowing full self-attention between the two texts. Three domain-specific BERT models were
ifne-tuned:
eficiency. 10
capturing scientific terminology and context. 11
• DistilBERT [25], which is a distilled, lightweight transformer model optimized for computational
• SciBERT [26], a domain-specific transformer pretrained exclusively on scientific tests and therefore
• MedBert [16], is a specialized transformer pretrained on biomedical literature and very suitable
for capturing biomedical concepts and nuanced vocabulary.12</p>
        <p>The three BERT models were trained in three epochs and hyperparameter tuning on the development
set was performed with the batch sizes as well as the learning rates. A learning rate of 2 × 10− 5 and a
batch size of 16 have proven to be the most efective. Models were trained with a data collator with
padding, a max-sequence-length of 512 tokens and a binary cross-entropy loss.</p>
        <p>Candidate-set size. Diferent approaches in literature [ 27, 28] made clear that more candidate
documents do not necessarily mean better performance. To examine the impact of this re-ranking
depth, the cross-encoders were run with diferent numbers of candidate documents between 5 and 100.
Model Capacity. Table 3 summarizes the parameter counts of all re-ranking models investigated.
This overview highlights the latency vs. capacity trade-ofs inherent in choosing a reranker for
scientificclaim retrieval.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Metrics and Significance</title>
        <p>Retrieval performance is assessed using MRR@1, MRR@5, MRR@10, Recall@5 and Recall@10. MRR@5
is the oficial evaluation metric for CheckThat! 2025 Task 4b (Scienticfi Claim Source Retrieval), as it
balances emphasis on early relevant hits with resilience to occasional noisy rankings. MRR@k measures
the average position of the single relevant document within the top k, but can be disproportionately
influenced by a small number of queries whose relevant document ranks exceptionally high. Recall@k
complements MRR by measuring the fraction of queries for which the relevant document appears
within the top k, ensuring adequate candidate set coverage for downstream re-rankers.</p>
        <p>Let  denote the set of queries and, for each  ∈ , let  represent the 1-based rank of the single
relevant document (or +∞ if not retrieved). Define the indicator 1( ≤ ), equal to 1 if  ≤  and 0
otherwise. The metrics are:
|| ∈ 
1 ∑︁ 1 1( ≤ ),  ∈ {1, 5, 10},
1 ∑︁ 1( ≤ ),  ∈ {5, 10}.
|| ∈
(5)
(6)
• : set of all queries.
• ||: total number of queries.
• : 1-based rank of the relevant document for query  (or +∞ if not returned).
• : cut-of depth (  ∈ {1, 5, 10} for MRR; {5, 10} for Recall).</p>
        <p>• 1( ≤ ): indicator function, 1 if the relevant document is in the top-, 0 otherwise.
12https://huggingface.co/Charangan/MedBERT</p>
        <p>Higher MRR@k indicates that relevant documents tend to appear closer to the top (with positions
beyond k contributing zero), while higher Recall@k indicates that a larger fraction of queries retrieve
their relevant document within the top-k results.</p>
        <p>To evaluate the significance of diference in test set performance relevant to the research questions,
the Wilcoxon signed-rank test is applied for MRR@5 and McNemar’s test for MRR@1. For research
question 1, we compare the E5 predictions from the control trial (title, abstract) with the trial using
all metadata fields (title, abstract, authors, journal, source). To address research question 2, the base
setup E5 predictions are compared with the E5 trained using additional hard negative examples. For
research question 3, we compare the base setup E5 predictions with the same predictions after re-ranking
using SciBERT. Note that some violations of the Wilcoxon assumptions, particularly symmetry of the
diferences of per-query MRR@5 scores, cannot be completely ruled out.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>This section presents the results and analyses of experiments run in stage 1 and stage 2. It provides an
overall comparison of the most promising methods applied. Most notably, diferent neural representation
learning approaches, such as hard negative learning regimes and neural re-ranking, are applied.</p>
      <sec id="sec-5-1">
        <title>5.1. Sparse Retrieval Evaluation</title>
        <p>Impact of preprocessing. Across all splits, the spaCy variant dominates the no-preprocessing run
on every metric. Focusing on the leaderboard metric MRR@5, lemmatization and stop-word removal
raise performance by</p>
        <p>These improvements confirm that normalizing inflectional variants and removing high-frequency
function words reduces term–document noise and helps lexical matching.</p>
        <p>Rank-depth trend. MRR increases from ranks @1 to @5, rising from +0.0429 to +0.0503 in the
raw variant and from +0.0507 to +0.0640 in the spaCy variant, but then increases by only +0.0044 to
+0.0054 in the raw variant and by +0.0062 to +0.0072 in the spaCy variant between ranks @5 and @10,
indicating that virtually all useful gains occur within the first five ranks.</p>
        <p>Recall increases from +0.0318 to +0.0393 for raw and from +0.0463 to +0.0526 for spaCy between @5
and @10, confirming that depth beyond five documents yields diminishing returns. Pre-processing
therefore enhances retrieval exactly at the cut-of that determines the challenge ranking, while improvements
at deeper ranks remain limited.</p>
        <p>Field-weighted variant. Metadata-aware extensions to BM25 have been shown to ofer negligible
benefits over the original formulation, with large-scale reproducibility studies reporting no significant
diferences in retrieval efectiveness [ 29]. Consequently, the standard BM25 configuration is adopted as
the primary baseline.</p>
        <p>Generalization gap. Absolute scores drop from Train/Dev to Test under both configurations,
indicating a distribution shift between the public splits and the blind evaluation corpus. Nevertheless, the
relative advantage of the spaCy pipeline is preserved, suggesting that the linguistic normalization it
provides is robust to topic drift.</p>
        <p>Recall@5
Head-room for dense methods. Even with preprocessing, Recall@10 reaches only 0.6508 on Test,
leaving substantial room for downstream neural re-rankers or hybrid retrieval to improve coverage. The
sparse baseline therefore ofers a realistic yet reproducible starting point for subsequent components in
the pipeline, and the following evaluation sections will demonstrate how more advanced (dense and
hybrid) approaches close this performance gap.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Retrieval Pipeline</title>
        <p>In the following paragraphs, the results of the neural representation learning experiments are presented
and evaluated. Subsequently, the re-ranking of the best predictions generated in that step is evaluated.
5.2.1. Phase 1: Neural Representation Learning
Base setup results. Table 5 presents the retrieval performance of diferent models when trained
using the base setup, including traditional Word2Vec embeddings and diferent dual encoder models,
across the training, development, and test sets.</p>
        <p>For Word2Vec, the performance improved with spaCy preprocessing, as it did for the sparse retrieval
model in Section 5.1. It was also tested to use spaCy preprocessing for the transformer-based sentence
embedding models. The tests showed, as Haviana et al. [30] also discussed, that preprocessing steps
such as lemmatization and stopword removal do not improve but rather degrade performance for
transformer-based sentence embedding models because the models otherwise are not able to capture the
rich contextual information. Therefore, the results in the table show Word2Vec with spaCy preprocessing
and the other models without spaCy preprocessing.</p>
        <p>It is evident that Word2Vec performs significantly worse than the transformer-based models. Even
without fine-tuning, E5 outperforms all other models except MPNet.</p>
        <p>Without fine-tuning, E5 results in an MRR@5 of 0.5183, while its fine-tuned counterpart results in an
MRR@5 of 0.6018. This confirms that, in line with the findings of Rathinasamy et al. [31], fine-tuning
the E5 model yields significantly greater retrieval performance. The fine-tuned E5 performs noticeably
better than all other models, followed by MPNet. MSMarco and L6 achieve comparable scores despite
both being pre-trained on general text sequence matching and the L6 model having much fewer trainable
parameters, stressing the eficiency of the latter.</p>
        <p>Across models and metrics, there is a small gap in performance between the train and dev set,
suggesting mild overfitting. The discrepancy between train and test set, however, is around 0.1 or
upwards in all cases. This can be attributed to several possible reasons. One of these is a shift in the test
data distribution, which was already hinted at in Figure 2.</p>
        <p>For the best-performing model E5, batch size experiment results for the test set are shown in Table 6,
where the setup corresponds to the base setup, only varying a single hyperparameter.</p>
        <p>The results suggest a slight upward trend, with all test metrics resulting from a batch size of 64
exceeding batch size 8 metrics, showing some fluctuations with intermediate batch sizes. This confirms</p>
        <p>Recall@5</p>
        <p>Recall@10
Model
W2V
L6</p>
        <sec id="sec-5-2-1">
          <title>MPNet E5 E5</title>
        </sec>
        <sec id="sec-5-2-2">
          <title>MSMarco Yes No Yes</title>
          <p>Yes
No
Yes</p>
          <p>Recall@5</p>
          <p>Recall@10
E5
E5
E5
E5
8
16
32
64
the understanding of Karpukhin et al. [11], that in-batch negative training settings benefit from larger
batch sizes, as it increases the ratio of negative/positive training examples for any given query.
Incorporating Document Metadata. Table 7 shows the test set performance of E5 using chunked
tokenization with max- and mean-pooling.</p>
          <p>Recall@5</p>
          <p>Recall@10
title+abstract (control)
title+abstract+authors
title+abstract+journal
title+abstract+source
all fields
0.5166
0.5256
0.5194
0.5207
0.5228
0.6020
0.6089
0.6031
0.6049
0.6069
0.6102
0.6173
0.6117
0.6127
0.6152
0.7324
0.7351
0.7310
0.7344
0.7351
0.7960
0.8001
0.7974
0.7960
0.7994</p>
          <p>The control trial reveals that applying chunked tokenization and max- and mean pooling to combine
chunks yields no substantial change in test set performance measured by MRR@5 in comparison to the
base setup E5 (Table 5) while using the same document fields. This finding is somewhat surprising, as
we showed in Table 2 that some documents exceed the maximum sequence length, confirming that
information is lost using the base setup without chunked tokenization. This might be attributed to only
little meaningful abstract information being truncated using the base setup. For instance, since not all
documents in the collection are necessarily mentioned in any queries, long documents might simply
not be referenced often in our dataset.</p>
          <p>Adding any additional field individually provides an improvement between 0.0069 and 0.0011, with
the authors field contributing the most information and journal the least. However, it can be concluded
that title and abstract convey the most relevant information to match queries and documents, while
author, journal and source information only improve performance marginally compared to the control
trial. Finally, using all considered fields together improves test set MRR@5 by 0.0049. The control
trial and all-fields trial correctly rank roughly 51.7% and 52.3% of positive documents in first place,
respectively, as measured by MRR@1. To determine the significance of the test set diferences (control
vs. all fields) in MRR@1 and MRR@5, McNemar’s test and the Wilcoxon signed-rank test are applied,
yielding p-values of 0.2430 and 0.0715, respectively. As both are larger than  = 0.05, we conclude that
there is no evidence of a significant diference in retrieval performance when adding further document
metadata in addition to title and abstract.</p>
          <p>Adding Hard Negatives. Table 8 shows the retrieval metrics for E5 when incorporating hard negative
training examples as well as combining hard negative examples with additional document metadata.
We found that adding one hard negative document per query generally results in stronger overfitting</p>
          <p>Recall@5</p>
          <p>Recall@10</p>
        </sec>
        <sec id="sec-5-2-3">
          <title>Train</title>
          <p>Dev</p>
        </sec>
        <sec id="sec-5-2-4">
          <title>Test</title>
        </sec>
        <sec id="sec-5-2-5">
          <title>Train</title>
          <p>Dev
Test
compared to using only in-batch negatives as in the base approach, as reflected in the relatively larger
gap between train and dev scores for both E5+HN and E5+HN+All fields. Despite this, the test score
measured by MRR@1 and MRR@5 is improved from 0.5166 and 0.6018 to 0.5353 and 0.6171 when
providing one additional hard negative, respectively, when compared to the base setup E5. To test the
significance of the test set MRR@1 and MRR@5 performance gains between the base setup E5 and
E5+HN, we apply the same tests as before. McNemar’s test for MRR@1 and the Wilcoxon signed-rank
test for MRR@5 yield p-values of 0.0159 and 0.0012, respectively, falling below  = 0.05 in both cases.
We therefore conclude that applying the hard negative training regime in addition to in-batch negatives
significantly improves retrieval performance.</p>
          <p>The positive document for a query is ranked roughly 1.2% more often in the top 5 predictions with
E5+HN than in the base setup. By contrast, additionally providing a hard-negative setup with the
additional document fields authors, journal and source (row with E5+HN+All Fields) yields comparable
results as E5+HN across the board.
5.2.2. Re-Ranking
The following paragraphs report the results of the implemented neural re-ranking experiments.
Comparison of Re-ranking Models. To quantify the influence of the cross-encoder on re-ranking
quality, we applied three models—DistilBERT, SciBERT, and MedBERT—to the candidate lists generated
by the baseline E5 dual encoder. Table 9 reports mean reciprocal rank at one and five (MRR@1, MRR@5)
and recall at five (Recall@5) on the oficial train, development, and test partitions. All hyperparameters,
including the batch size of 16 and the top-5 candidate depth, were held constant across runs.</p>
          <p>Recall@5
Model</p>
        </sec>
        <sec id="sec-5-2-6">
          <title>SciBERT</title>
        </sec>
        <sec id="sec-5-2-7">
          <title>DistilBERT</title>
        </sec>
        <sec id="sec-5-2-8">
          <title>MedBERT Train Dev Test</title>
          <p>Across all splits, SciBERT achieved the strongest overall performance, reaching MRR@5 scores of
0.8338, 0.7070, and 0.6607 on the train, development, and test sets, respectively, with corresponding
Recall@5 values of 0.8490, 0.7921, and 0.7317. DistilBERT yielded slightly lower efectiveness (MRR@5
= 0.7973/0.6954/0.6340; Recall@5 = 0.7615/0.6343/0.7317), consistent with its more compact architecture.
MedBERT placed between the two (MRR@5 = 0.8178/0.6747/0.6381; Recall@5 = 0.8490/0.7921/0.7317).
All models exhibited a monotonic decline from training to testing, indicating limited generalization
beyond the training distribution. These findings underscore the benefit of domain-specific pre-training
for scientific claim re-ranking while highlighting the trade-ofs among model capacity, domain
specialization, and robustness.</p>
          <p>Impact of Batch Size. To assess whether optimization stability afects re-ranking accuracy, we
ifne-tuned SciBERT with batch sizes of 8 and 16, keeping all other settings fixed. Table 10 shows that
a batch size of 16 yields marginally higher MRR and recall on every split, suggesting that the larger
mini-batch provides smoother gradient estimates and slightly better convergence. The gap, however, is
small, and test-set performance remains stable, indicating that SciBERT generalizes well under either
configuration.</p>
          <p>Recall@5
8
16</p>
        </sec>
        <sec id="sec-5-2-9">
          <title>Train</title>
          <p>Dev</p>
        </sec>
        <sec id="sec-5-2-10">
          <title>Test</title>
        </sec>
        <sec id="sec-5-2-11">
          <title>Train</title>
          <p>Dev
Test
Impact of re-ranking depth. Having established the relative strengths of each model, we next
investigate how the number of input candidates afects re-ranking performance for our best model
(SciBERT). Table 11 reports MRR@1, MRR@5, and Recall@5 for  = 5, 10, and 20, highlighting the
trade-of between richer candidate pools and the introduction of noise.</p>
          <p>The results in Table 11 reveal a trade-of between re-ranking depth and generalization. While
increasing the candidate set from 5 to 20 leads to steadily higher MRR@1, MRR@5, and Recall@5 on</p>
          <p>Recall@5</p>
          <p>Recall@10
the training split (e.g. MRR@5 rising from 0.8338 at depth 5 to 0.9031 at depth 20), the development and
test splits exhibit a peak in performance at the intermediate depth 10. Specifically, performance on the
development set improves slightly when moving from depth 5 (MRR@5 = 0.7070) to depth 10 (MRR@5
= 0.7084) but then declines at depth 10 (MRR@5 = 0.6990). A similar pattern can be observed in the test
pattern and other metrics such as Recall@5.</p>
          <p>These findings suggest that while larger candidate pools allow the model to memorize more positives
during training, they also introduce greater noise and irrelevant documents. This behavior can be
explained by the fact that a relatively high proportion of the gold documents were already included in
the first 10 results of the previous dual-encoding step. Taking a depth that is higher than necessary can
harm re-ranker performance on unseen data. Due to these reasons, a moderate re-ranking depth of 10
was found to best balance training fit and generalization.</p>
          <p>E5+HN+All Dual-Encoder Candidates with SciBERT Re-ranking. Using the optimized dual
encoder retrieval stage, the top-5 and top-10 candidates were re-ranked with SciBERT as before. Table 12
reports the resulting metrics.</p>
          <p>Recall@5</p>
          <p>Recall@10</p>
          <p>Finally, we combined the enhanced dual encoder configuration (E5+HN+All) with SciBERT re-ranking
at depths k=5 and k=10 (Table 12). Under this setting, k=5 outperforms k=10 on the development split
(MRR@5 = 0.7232 vs. 0.7055) and on the test split (MRR@5 = 0.6784 vs. 0.6559). The top-5 dense
retrieval predictions therefore already cover the majority of relevant documents, and enlarging the pool
primarily adds noise, degrading re-ranking accuracy. These results confirm that a shallow re-ranking
depth sufices when the upstream retriever is strong.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Comparative Model Performance</title>
        <p>The best re-ranking stage was conducted using the SciBERT cross-encoder applied to the top-10
document-candidate set of the base setup dual encoder. This emerged as the best-performing model
in the re-ranking experiments. This model achieved a test-set MRR@5 of 0.6828 and an exact-match
MRR@1 of 0.6279, markedly outperforming the strongest dual encoder baseline—base setup E5—which
recorded MRR@5 = 0.6174 and MRR@1 = 0.5332 as shown in Table 13. It also reaches the highest
Recall@5 with 0.7656. The only measure in which E5+HN+All Fields performs better is Recall@10,
likely due to broader document field integration capturing more edge-case matches.</p>
        <p>Recall@5</p>
        <p>Recall@10</p>
        <p>To assess the reliability of these improvements, statistical tests were conducted on the per-query
performance diferences across all test queries. A Wilcoxon signed-rank test on the paired MRR@5
scores yielded a p-value of  = 1.9 × 10− 21, confirming that the observed gains are statistically
significant. Furthermore, a McNemar’s exact test on the top-1 correctness contingency tables yielded
231 queries that were only correct with Sci-Bert and 70 only with E5. This produced a p-value of
 = 3.0 × 10− 21 which validates that incorporating a SciBert-based re-ranking stage delivers consistent,
statistically significant enhancements over the dual encoder baseline.</p>
        <p>Figure 2 shows the distribution of the rank position where the correct query was found by the
developed information retrieval system. The x-axis represents the rank of the correct prediction within
the top 10 results with an additional bar labeled "11+" for cases where the correct document was not
found in the top 10, and the y-axis shows the percentage of queries for which the correct article was
found at each rank.</p>
        <p>Several patterns can be seen in the figure:
1. Improved Top-1 Placement: The dual encoder baseline places the correct document first
for approximately 42% of queries. Neural representation learning raises this rate to about 53%,
and SciBERT re-ranking further increases Top-1 accuracy to approximately 62%, confirming its
superior fine-grained discrimination.
2. Steeper Drop-of Beyond Rank 1: All systems exhibit a rapid decline after the first position.</p>
        <p>Both dense retrieval and re-ranking resolve a greater proportion of queries within ranks 2–5
compared to the baseline, demonstrating that semantic models push more correct documents
into the early ranks.
3. Reduction in “No-Hit” Cases: At rank 11+ (i.e., correct paper not in the top 10), the baseline
fails for about 35% of queries, whereas dense retrieval and re-ranking both reduce no-hit rates to
roughly 20%.</p>
        <p>Note that the SciBERT re-ranker was applied to only the top 10 candidate documents from the dense
retrieval. Therefore, it pushed documents that were previously ranked in the first ten positions, thereby
improving the Top-1 accuracy while losing accuracy in places 2-10 compared to the dense retrieval.</p>
        <p>Given the presented results, the research questions guiding the present work can now be answered:
RQ1: Which document metadata is most useful to create embeddings that optimize retrieval performance
for the task at hand?
Adding the authors field to title + abstract gives the largest standalone gain (MRR@5 rising from
0.6020 to 0.6089), whereas journal metadata contributes the least (MRR@5 to 0.6031) and source
falls in between (0.6049). Using all fields together yields a modest overall improvement (MRR@5
to 0.6069). However, using all fields does not reach statistical significance (McNemar’s p=0.2430,
Wilcoxon p=0.0715). Thus, while author information appears most informative, title and abstract
are the primary drivers of retrieval performance.</p>
        <p>RQ2: Does incorporating BM25-mined hard negative training examples for training improve
query-todocument retrieval performance?
Incorporating one BM25-mined hard negative per query into the E5 training significantly boosts
test-set retrieval: MRR@1 rises from 0.5166 to 0.5353 and MRR@5 from 0.6018 to 0.6171, with
McNemar’s (p = 0.0159) and Wilcoxon (p = 0.0012) tests confirming these gains are statistically
significant.</p>
        <p>RQ3: To which degree does re-ranking retrieved results using cross-encoders improve the retrieval
performance?
Table 13 and Figure 2 jointly answer this question. When the best dual encoder run
(E5+HN+All Fields) is enhanced with the SciBERT cross-encoder applied to the top-10 dense
candidates, MRR@5 improves from 0.6174 to 0.6828 which is a relative gain of +10.6%, a Wilcoxon
signed-rank test and a McNemar’s exact test confirm that the uplift is highly significant.
The histogram in 2 shows that re-ranking moves roughly an additional 10% of queries into the
ifrst position while also cutting the no-hit rate almost in half, compared to the baseline models.
The cross-encoder re-ranking models investigated in this paper boost retrieval quality by 0.07
absolute MRR@5 points over an already strong dense retrieval model, with results that are both
practically large and statistically significant.</p>
        <sec id="sec-5-3-1">
          <title>Baseline</title>
        </sec>
        <sec id="sec-5-3-2">
          <title>Neural Representation Learning Rerank 60 50</title>
          <p>)
%
(
s
ion 40
t
c
i
d
reP 30
t
c
e
r
r
o 20
C
10
0
1
2
3
4
5
6
7
8
9
10
11+</p>
        </sec>
        <sec id="sec-5-3-3">
          <title>Rank Position</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper presents the experiments conducted by the AIRwaves team for CheckThat! 2025 Task 4b:
Scientific Claim Source Retrieval. The approach explored a two-stage IR pipeline leveraging a dual
encoder retrieval, followed by neural re-ranking with cross-encoder models, particularly SciBERT,
DistilBERT and MedBERT.</p>
      <p>Several experiments were conducted with multiple retrieval models, dual encoder configurations
(both with and without hard negatives), and varied document representations to address the unique
challenges of mapping informal, concise social media posts to their corresponding scholarly sources.
Notably, the system achieved a 2nd place ranking out of 31 participating teams on the oficial leaderboard,
underscoring the robustness of the overall approach.</p>
      <p>Key findings of the work include that incorporating hard negatives into dual encoder training
can improve retrieval performance significantly compared to using only in-batch negative document
examples. Furthermore, re-ranking with cross-encoder models like SciBERT consistently boosts mean
MRR and overall recall across all splits, highlighting their strength in capturing nuanced
query-todocument relationships. Additionally, by experimenting with several diferent configurations, a practical
guideline has been established, demonstrating what might or might not be promising to explore further.</p>
      <p>Perspectives for future work that build upon our results are manifold. If the goal is a single-stage
retrieval framework due to time and computation constraints that do not allow subsequent re-ranking,
there is room for improvement of the phase 1 dual encoder method shown in the present work. For
instance, we hypothesize that changes to the hard negative setup could yield further improvements. Our
implementation relies on statically mining BM25 hard negative examples per query, where examples
deemed “hard“ by a sparse model might not always be challenging for neural approaches capturing
nuanced semantics. A possible improvement might dynamically re-mine hard negative documents
using the current model weights after each epoch.</p>
      <p>In terms of neural re-ranking, generative methods with large models like MonoT5 were not explored.
We encourage future literature to build upon our work by evaluating what retrieval performance is
possible when using such models.</p>
      <p>The real-world usefulness of the developed approach depends on the specific use case of the system.
After re-ranking, roughly 62.8% posts can be matched to the correct publication right away, as shown
by the SciBERT MRR@1. For a zero-shot system that relies on an instantaneous correct answer, this is
likely insuficient. In the first 10 suggestions given a query, the correct document out of potentially
thousands, will be contained roughly 79.5% of the time. If manual postprocessing of a few documents
is an option, the developed system might therefore be tremendously helpful in narrowing down the
search space.</p>
      <p>Overall, the study demonstrates that hybrid retrieval strategies, combined with careful sampling and
neural re-ranking, are powerful tools for linking social media claims to relevant scientific literature.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During preparation the authors followed the CEUR-WS activity taxonomy13 and used Grammarly
for grammar and spelling checks, and ChatGPT (GPT-4o) for writing-style improvement, citation
management, and formatting assistance. After using these tools, the authors reviewed and edited the
content as needed and take full responsibility for the publication’s content.
F. M. Nardini, F. Pinelli, F. Silvestri, N. Tonellotto (Eds.), Advances in Information Retrieval,
Springer Nature Switzerland, Cham, 2025, pp. 467–478.
[2] G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs
of the Evaluation Forum, CLEF 2025, Madrid, Spain, 2025.
[3] F. Alam, J. M. Struß, T. Chakraborty, S. Dietze, S. Hafid, K. Korre, A. Muti, P. Nakov, F. Ruggeri,
S. Schellhammer, V. Setty, M. Sundriyal, K. Todorov, V. Venktesh, Overview of the CLEF-2025
CheckThat! Lab: Subjectivity, fact-checking, claim normalization, and retrieval, in: J. Carrillo-de
Albornoz, J. Gonzalo, L. Plaza, A. García Seco de Herrera, J. Mothe, F. Piroi, P. Rosso, D. Spina,
G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction.</p>
      <p>Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025), 2025.
[4] S. Hafid, Y. S. Kartal, S. Schellhammer, K. Boland, D. Dimitrov, S. Bringay, K. Todorov, S. Dietze,</p>
      <p>Overview of the CLEF-2025 CheckThat! lab task 4 on scientific web discourse, in: [2], 2025.
[5] A. Bansal, R. Jain, J. Bedi, Detecting covid-19 vaccine stance and symptom reporting from tweets
using contextual embeddings., in: FIRE (Working Notes), 2022, pp. 361–368.
[6] A. Esteva, A. Kale, R. Paulus, K. Hashimoto, W. Yin, D. Radev, R. Socher, Covid-19
information retrieval with deep-learning based semantic search, question answering, and abstractive
summarization, NPJ digital medicine 4 (2021) 68.
[7] W. Guo, H. Li, H. Ji, M. Diab, Linking tweets to news: A framework to enrich short text data in social
media, in: H. Schuetze, P. Fung, M. Poesio (Eds.), Proceedings of the 51st Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational
Linguistics, Sofia, Bulgaria, 2013, pp. 239–249. URL: https://aclanthology.org/P13-1024/.
[8] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-networks, in:
K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019,
pp. 3982–3992. URL: https://aclanthology.org/D19-1410/. doi:10.18653/v1/D19-1410.
[9] J. Danovitch, Linking social media posts to news with siamese transformers, 2020.</p>
      <p>arXiv:2001.03303.
[10] J. Piotrowski, M. Wachnicki, M. Perlik, J. Podolak, G. Rucki, M. Brzozowski, P. Olejnik, J. Kozłowski,
T. Nocoń, J. Kozieł, S. Giziński, P. Sankowski, Contrastive news and social media linking using
bert for articles and tweets across dual platforms, 2023. URL: https://arxiv.org/abs/2312.07599.
arXiv:2312.07599.
[11] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense
passage retrieval for open-domain question answering, in: B. Webber, T. Cohn, Y. He, Y. Liu
(Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 6769–6781. URL:
https://aclanthology.org/2020.emnlp-main.550/. doi:10.18653/v1/2020.emnlp-main.550.
[12] D. Gillick, S. Kulkarni, L. Lansing, A. Presta, J. Baldridge, E. Ie, D. Garcia-Olano, Learning dense
representations for entity retrieval, in: M. Bansal, A. Villavicencio (Eds.), Proceedings of the 23rd
Conference on Computational Natural Language Learning (CoNLL), Association for Computational
Linguistics, Hong Kong, China, 2019, pp. 528–537. URL: https://aclanthology.org/K19-1049/. doi:10.
18653/v1/K19-1049.
[13] A. Pasin, N. Ferro, et al., Seupd@ clef: team axolotl on rumor verification using evidence from
authorities, Faggioli et al.[22] (2024).
[14] W. Mansour, T. Elsayed, A. Al-Ali, This is not new! spotting previously-verified claims over twitter,</p>
      <p>Information Processing &amp; Management 60 (2023) 103414. doi:10.1016/j.ipm.2023.103414.
[15] S. MacAvaney, A. Cohan, N. Goharian, Sledge-z: A zero-shot baseline for covid-19 literature
search, in: Proceedings of EMNLP, 2020, pp. 4171–4179.
[16] C. Vasantharajan, K. Z. Tun, H. Thi-Nga, S. Jain, T. Rong, C. E. Siong, Medbert: A pre-trained
language model for biomedical named entity recognition, in: 2022 Asia-Pacific Signal and
Information Processing Association Annual Summit and Conference (APSIPA ASC), 2022, pp. 1482–1488.
doi:10.23919/APSIPAASC55919.2022.9980157.
[17] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis,
R. M. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen,
B. Stilson, A. D. Wade, K. Wang, N. X. R. Wang, C. Wilhelm, B. Xie, D. M. Raymond, D. S. Weld,
O. Etzioni, S. Kohlmeier, CORD-19: The COVID-19 open research dataset, in: K. Verspoor, K. B.
Cohen, M. Dredze, E. Ferrara, J. May, R. Munro, C. Paris, B. Wallace (Eds.), Proceedings of the 1st
Workshop on NLP for COVID-19 at ACL 2020, Association for Computational Linguistics, Online,
2020. URL: https://aclanthology.org/2020.nlpcovid19-acl.1/.
[18] S. Robertson, S. Walker, Some simple efective approximations to the 2–poisson model for
probabilistic weighted retrieval, in: Proc. 17th ACM Int. Conf. on Research and Development in
Information Retrieval (SIGIR ’94), 1994, pp. 232–241.
[19] S. Robertson, H. Zaragoza, The probabilistic relevance framework: Bm25 and beyond, Found.</p>
      <p>Trends Inf. Retr. 3 (2009) 333–389. URL: https://doi.org/10.1561/1500000019. doi:10.1561/
1500000019.
[20] M. Honnibal, I. Montani, S. V. Landeghem, A. Boyd, spacy: Industrial-strength natural language
processing in python, in: Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations, Association for Computational Linguistics, 2020,
pp. 1–7.
[21] N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, I. Gurevych, BEIR: A heterogeneous benchmark for
zero-shot evaluation of information retrieval models, in: Proc. 43rd European Conf. on Information
Retrieval (ECIR 2021), 2021, pp. 3–20.
[22] C. Wu, R. Zhang, J. Guo, Y. Fan, X. Cheng, Are neural ranking models robust?, ACM Transactions
on Information Systems 41 (2022) 29:1–29:36.
[23] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, F. Wei, Text
embeddings by weakly-supervised contrastive pre-training, 2024. URL: https://arxiv.org/abs/2212.03533.
arXiv:2212.03533.
[24] C.-Y. Lee, P. Gallagher, Z. Tu, Generalizing pooling functions in cnns: Mixed, gated, and tree,
IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2018) 863–875. doi:10.1109/
TPAMI.2017.2703082.
[25] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of BERT: smaller, faster,
cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[26] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, in: Proceedings
of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, pp.
3615–3620.
[27] C. Xiong, Z. Liu, S. Sun, Z. Dai, K. Zhang, S. Yu, Z. Liu, H. Poon, J. Gao, P. Bennett, Cmt in
trec-covid round 2: Mitigating the generalization gaps from web to special domain search, 2020.</p>
      <p>URL: https://arxiv.org/abs/2011.01580. arXiv:2011.01580.
[28] S. Büttcher, C. L. A. Clarke, P. C. K. Yeung, Index pruning and result reranking: Efects on
ad-hoc retrieval and named page finding, in: Text Retrieval Conference, 2006. URL: https://api.
semanticscholar.org/CorpusID:8237854.
[29] C. Kamphuis, A. de Vries, L. Boytsov, J. Lin, Which BM25 Do You Mean? A Large-Scale
Reproducibility Study of Scoring Variants, 2020, pp. 28–34. doi:10.1007/978-3-030-45442-5_4.
[30] S. F. Chaerul Haviana, S. Mulyono, Badie’Ah, The efects of stopwords, stemming, and
lemmatization on pre-trained language models for text classification: A technical study, in: 2023 10th
International Conference on Electrical Engineering, Computer Science and Informatics (EECSI),
2023, pp. 521–527. doi:10.1109/EECSI59885.2023.10295797.
[31] K. Rathinasamy, J. Nettar, A. Kumar, V. Manchanda, A. Vijayakumar, A. Kataria, V. Manjunath,
C. GS, J. S. Sodhi, S. Shaikh, W. A. Khan, P. Singh, T. D. Ige, V. Tiwari, R. A. Mondal, H. K, S. Reka,
C. Amancharla, F. ur Rahman, H. P. A, I. Saha, B. Tiwary, N. S. Patel, P. T. S, B. A. J, Priyapravas,
M. R. Tarafdar, Enterpriseem: Fine-tuned embeddings for enterprise semantic search, 2024. URL:
https://arxiv.org/abs/2406.00010. arXiv:2406.00010.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hafid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Korre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schellhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <surname>V. V.</surname>
          </string-name>
          ,
          <article-title>The clef-2025 checkthat! lab: Subjectivity, fact-checking, claim normalization, and retrieval</article-title>
          , in: C.
          <string-name>
            <surname>Hauf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Jannach</surname>
          </string-name>
          , G. Kazai,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>