<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Staudinger);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>at CheckThat! 2025: Retrieve the Implicit - Scientific Evidence Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Moritz Staudinger</string-name>
          <email>moritz.staudinger@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alaa El-Ebshihy</string-name>
          <email>alaa.el-ebshihy@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wojciech Kusa</string-name>
          <email>wojciech.kusa@nask.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florina Piroi</string-name>
          <email>florina.piroi@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Allan Hanbury</string-name>
          <email>allan.hanbury@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Fact Checking, Information Retrieval, Scientific Document Retrieval, Listwise Ranking</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>NASK National Research Institute</institution>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Research Studios Austria</institution>
          ,
          <addr-line>Data Science Studio</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Technische Universität Wien</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>With the rapid growth of scientific publications and their increasing circulation through social media, tracing claims back to their original sources has become increasingly dificult. The CheckThat! 2025 Claim Source Retrieval task addresses this challenge by requiring systems to retrieve the scientific paper implicitly referenced in a social media post. In this paper, we present our approach to solving this task. We explore a two-stage retrieval pipeline that combines lexical and dense retrievers with both pointwise and listwise re-rankers. We evaluate the impact of enriching document representations with full-text and summary content from CORD-19, and analyze trade-ofs between model size, retrieval efectiveness, and runtime.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>CheckThat! 2025 lab, Scientific Claim Source Retrieval , focuses specifically on this retrieval step. The
Scientific Claim Source Retrieval task is defined as follows: given an implicit reference to a scientific
paper, specifically a social media post on Twitter that mentions a research publication without a URL,
the goal is to retrieve the mentioned paper from a pool of candidate scientific documents.</p>
      <p>In this work, we present out approaches to solve the CheckThat! Scientific Claim Source Retrieval
challenge at CLEF 2025 [3]. We evaluated the impact of fine-tuned pointwise ranker models, in contrast
to open-domain listwise ranker models. We further investigated if the full-text or the summary of the
scientific evidence is beneficial for the ranking stage.</p>
      <p>The rest of this paper is structured as follows: in Section 2 we discuss previous approaches to scientific
claim retrieval. In Section 3 we discuss our approaches to retrieve the most relevant documents, while
Section 4 presents our results for the CheckThat! Lab. In Section 5 we discuss our findings, before we
conclude our work in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        The scientific fact-checking task is a variation of the general fact-checking process, focusing specifically
on assessing claims of scientific knowledge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. While the primary role of general fact-checking is
to detect misinformation, scientific fact-checking aims to verify research hypotheses and identify
supporting evidence within scientific work. It leverages evidence sources, often derived from scientific
publications, to validate claims.
      </p>
      <p>
        Generally, the framework of scientific fact-checking consists of three main components [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], namely:
(1) document retrieval, (2) evidence (rationale) selection, and (3) verdict prediction [4]. Document
retrieval involves identifying relevant documents from a scientific corpus that may contain evidence
related to the claim, using sparse or dense retrieval models. Evidence selection, then, extracts specific
sentences or passages from these documents that can serve as rationales to support or refute the
claim. Finally, verdict prediction uses the selected evidence to classify the claim’s veracity, typically into
categories such as Supported, Refuted, or Not Enough Information. CheckThat! Task 4b focuses on the
document retrieval component, where the aim is to retrieve the specific scientific publication implicitly
referenced in a tweet, from a pool of candidate papers. In the following, we review previous work on
document retrieval approaches relevant to this task, including both traditional and neural methods.
      </p>
      <p>
        Vladika et. al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] presents a comprehensive overview of the scientific fact-checking task, covering
its main components, available datasets, and diferent approaches employed to address each stage of
the pipeline. In the context of the document retrieval, the goal is to identify relevant documents from
a scientific corpus that may contain evidence related to a given claim. This component is typically
solved with Information Retrieval (IR) techniques, which are categorized into sparse and dense retrieval
approaches. Sparse retrieval approaches, such as TF-IDF and BM25, rely on keyword matching through
inverted indexes. On the other hand, dense retrieval approaches use dense vector representations of
queries and documents, enabling retrieval based on semantic similarity [5].
      </p>
      <p>One of the most commonly used datasets in scientific fact-checking, and the basis for many developed
models, is the SCIFACT dataset [6]. The document retrieval task involves retrieving relevant abstracts
from a corpus of around 5k scientific abstracts from the biomedical domain. The baseline model,
VeriSci, relied on simple TF-IDF scoring to retrieve the top- abstracts. More advanced models, such as
VerT5erini and MultiVerS, adopted a two-stage retrieval strategy: first using BM25 to retrieve candidate
abstracts, followed by neural re-ranking with a T5-based model trained on the MS MARCO passage
dataset [7, 8]. Other approaches, including ParagraphJoint and ARSJoint, used dense retrieval with
BioSentVec embeddings [9], which were trained on a large corpus of biomedical texts. The SCIFACT
corpus was later expanded to 500k documents [10], resulting in a noticeable drop in overall fact-checking
performance. This highlights the limitations of existing retrieval methods and the need for more efective
semantic search techniques. The COVID-Fact dataset[11] addresses this challenge by retrieving snippets
from the top 10 results returned by the Google Search API for a given claim, mimicking a more realistic
evidence-gathering process.</p>
      <p>To the best of our knowledge, MultiVerS achieves state-of-the-art performance on the SCIFACT
dataset. Inspired by this approach, we use a BM25 followed by T5-based re-ranking as a baseline in our
experiments.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>As previously discussed, retrieval approaches for fact-checking typically achieve the best performance
when following a two-stage strategy. This consists of an initial fast retrieval component that retrieves
candidate documents for a given claim, followed by a reranking component that produces a final ranked
list of documents.</p>
      <p>In our work, we employed several retrieval and ranking strategies and also varied the type of
information provided to our pipeline. The following subsections describe the individual components of
our system.
3.1. Data</p>
      <sec id="sec-3-1">
        <title>3.1.1. Preprocessing</title>
        <p>To improve the retrieval performance of our lexical models, we applied a series of preprocessing steps to
both the input tweets and the scientific documents. Specifically, we removed URLs and mentions from
the tweets, and further cleaned the text by removing special characters (e.g., hashtags and punctuation),
digits, and non-ASCII characters. Afterward, we tokenized the text, removed stopwords, and applied
stemming to all tokens.</p>
        <p>We followed a similar preprocessing strategy for our embedding-based models, particularly for
removing URLs and special characters, as this was found to improve retrieval performance.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.1.2. Data Enrichment</title>
        <p>The subset of the CORD-19 dataset provided by the lab organizers includes only the abstracts of scientific
papers. However, the full CORD-19 dataset also contains full-text information for many publications.
Since inspecting the full-text can be valuable for scientific fact-checking, we analyzed the impact
of incorporating additional information—either in the form of summaries or full-texts into both the
retrieval and ranking stages.</p>
        <p>To this end, we utilized the full-text CORD-19 dataset (approximately 19GB in size) and linked its
full-text entries to the retrieval subset used in CheckThat! Task 4B using the cord_uid field.</p>
        <p>During this process, we found that only about 127,000 documents in the full CORD-19 dataset contain
full-text information, out of a total of roughly 369,000 documents. The coverage was similar, slightly
higher, in the provided subset as 2,759 out of 7,764 entries (approximately 35%) included a full-text
version.</p>
        <p>Despite this limitation, we aimed to test our hypothesis that incorporating either full-text or paper
summary, instead of the paper abstract, could improve retrieval performance. We used dense
representations of both abstracts and full-texts to extract the most semantically similar paragraphs and
generated summaries that were, on average, twice the length of the original abstracts. For generating
these summaries, we used the SentenceTransformer library with the all-MiniLM-L6-v23 model.</p>
        <sec id="sec-3-2-1">
          <title>3.2. Retrieval</title>
          <p>For the retrieval stage, we implemented one lexical model in BM25 (the rank_bm254 implementation)
and two dense bi-encoder models:</p>
          <p>• gtr-t5-base5
3https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
4https://pypi.org/project/rank-bm25/
5https://huggingface.co/sentence-transformers/gtr-t5-base</p>
          <p>We hypothesized that both dense models will outperform BM25, as fine-tuned dense retrieval models
generally achieve superior performance compared to lexical methods [12, 13]. In particular, we are
interested in analyzing how model size afects retrieval efectiveness in the context of fact-checking,
and whether improvements at the retrieval stage would also translate into gains during reranking.</p>
          <p>To test this, we fine-tuned both dense models on the provided CORD-19 subset using the
SentenceTransformer library. We employed the MultipleNegativesRankingLoss loss function, training over five
epochs with a learning rate of 2 ⋅ 10−5 on all available training queries.</p>
          <p>In addition, we implemented an ensemble retrieval approach by combining gtr-t5-large with
BM25 using reciprocal rank fusion. Such hybrid strategies—combining lexical and dense retrieval have
been shown to improve performance, as demonstrated by Althammer et al. [14] in the context of Dense
Passage Retrieval for legal texts.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.3. Ranking</title>
          <p>For the reranking stage, we investigated the efectiveness of a smaller, domain-specific pointwise ranker
model—monot5-base-med-msmarco in comparison to larger and more resource-intensive models.</p>
          <p>With the emergence of large language models (LLMs) since 2022 and the expansion of context
windows, listwise ranking strategies have gained popularity. These approaches typically outperform
pointwise and pairwise methods, as they consider a ranked list of documents at once, rather than
evaluating documents in isolation or in pairs. As pointwise rankers compute scores for each
querydocument pair independently, and therefore the computation time increases linearly. In contrast to
this, listwise models (e.g., MXBAI) process an entire document list at once, yielding relatively stable
runtimes across diferent list sizes.</p>
          <p>Therefore, we included three additional reranking models in our experiments:
• mxbai-rerank-base-v2 (listwise)
• mxbai-rerank-large-v2 (listwise)
• monot5-3b-inpars-v2-trec_covid (pointwise, trained on CORD-19)</p>
          <p>Given that the CheckThat! Task 4 dataset is derived from the medical domain (CORD-19), we
prioritized medical-domain language models for the pointwise rerankers.</p>
          <p>We fine-tuned both MonoT5 models—training the larger model for one epoch and the smaller one
for three epochs—as we observed no significant performance improvement with further training.
For supervision, we constructed training samples consisting of one relevant document and two hard
negatives sampled from the top-5 BM25 retrieval results.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>The CheckThat! Lab [3] provided a subset of the CORD-19 collection as a retrieval collection, along
with a set of tweets and their corresponding relevance labels as queries.</p>
        <p>The document collection contains 7,764 distinct entries and 17 columns. These features include fields
such as title and abstract, as well as metadata fields. The metadata fields include: pmc_id, pubmed_id,
license, publish time, authors, journal, mag_id, arxiv_id, label, time and a timet id.</p>
        <p>We excluded these metadata fields, as they were less relevant for retrieval performance and did not
add additional information. We enriched this corpus by linking to the CORD-19 full-text dataset and
generating summaries based on this additional information (see Section 3.1.2 for details).
6https://huggingface.co/sentence-transformers/gtr-t5-large</p>
        <p>The tweet dataset is divided into three splits: train, dev, and test. Each tweet is associated with
exactly one relevant scientific document. The train set contains 12,853 tweets, while the dev and test
sets contain 1,400 and 1,446 tweets, respectively.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Evaluation</title>
        <p>We evaluated our retrieval and reranking pipeline on the development and test sets of the shared task by
reranking the top 10, 20, and 50 retrieved candidates. Table 1 presents the results on the development set.
The initial retrieval stage achieved an MRR@10 of 0.6130 for BM25 and up to 0.6475 for GTR-T5-Large,
with only marginal gains beyond the top 10 positions.</p>
        <p>We observed a similar trend when applying reranking: most relevant documents were found within
the top 5 positions, and performance improved only slightly when extending the reranking depth. For
example, in the GTR-T5 + MonoT5 + Abstract setting, MRR increased by only 0.5% when reranking 10
documents and by approximately 1% at a depth of 50. This pattern aligns with the behavior of the MRR
metric, which is more sensitive to high-ranking positions.</p>
        <p>We also analyzed the runtime eficiency of diferent retriever–ranker combinations based on the dev
split of the provided dataset, as shown in Table 2. For the runtime evaluation we run our systems on a
server with 4 cores of a AMD EPYC 7413 CPU, an A40 GPU and 64 GBs of RAM.</p>
        <p>First-stage retrieval models were eficient, requiring between 0.009 and 0.045 seconds per query. In
contrast, rerankers varied substantially in speed, depending on model size and reranking depth. This is
expected: the MonoT5-med-base model has 220 million parameters, while MonoT5-med-3B has 3 billion.
The listwise rerankers (MXBAI) yielded stable runtime across diferent list sizes, as discussed before.</p>
        <p>Although runtime is not the primary concern in fact-checking—where accuracy is paramount—it is
still valuable to understand the trade-of between computational cost and retrieval quality.
Abstract</p>
        <p>MonoT5-med-base</p>
        <p>While the combination of GTR-T5-Large with MXBAI-large-v2 achieved the best performance
overall (MRR@5 = 0.7721), it required nearly twice the computation time compared to the smaller
MXBAI-base-v2, which still performed competitively. For this reason, we selected the smaller model for
our evaluation runs.</p>
        <p>We submitted eight runs for the leaderboard, summarized in Table 3. Our baseline run, combining
BM25 with the smaller MonoT5 model, achieved an MRR@5 of 0.6262. Subsequent runs aimed to
quantify how improvements in retrieval quality afected final ranking performance.</p>
        <p>Although dense retrievers outperformed BM25 by 0.02 to 0.035 on the development set, these gains
diminished after reranking, where diferences across configurations fell below 0.01—an efect also visible
in the evaluation of Schlatt et al. [15].</p>
        <p>Using full-text and summary inputs during ranking led to decreased MRR compared to using
abstracts alone. However, performance improved when using the larger MonoT5 model, with a gain of
approximately 0.015 on the test set when reranking 50 documents. This matches our observations on
the development set, where the larger model consistently outperformed the smaller one by 0.01 to
0.03. Our best-performing configuration used the listwise reranker MXBAI-base-v2, which achieved an
MRR@5 of 0.6568.</p>
        <sec id="sec-4-2-1">
          <title>Retriever</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>In this section, we present a discussion and findings from the results of our experiments.
Efect of size of reranked candidates We observed that the performance slightly changed or
constant while increasing the reranking depth. This shows that most relevant documents are found
withing the top 5 to 10 candidates. This suggests that limiting reranking to a smaller candidate set may
be suficient for efective performance, enabling more eficient use of computational resources.
Model size vs. time We also found that model size afects the retrieval efectiveness, in terms of
performance and time. Larger rerankers, such as MonoT5-med-3B and MXBAI-large-v2, consistently
outperformed smaller models. However, these gains come at the cost of increased runtime and memory
usage.</p>
      <p>Pointwise vs. Listwise rerankers Our comparison shows that, although the listwise reranking
approaches outperforms the pointwise alternatives, it comes at the cost of the retrieval time. However,
they are especially suitable for reranking small candidate sets given there increased performance on
sets of size 10.</p>
      <p>Full-text and Summary vs. Abstract retrieval Our results indicate that enriching the retrieval
corpus with full-text content or extracted summaries does not necessarily improve performance over
using abstracts alone. In some cases, these additional inputs even introduced noise and reduced
ranking quality. This suggests that naive enrichment strategies may not yield benefits and that future
work should explore more targeted methods, such as using section-aware embeddings or fine-tuned
summarization models.</p>
      <p>Relevant Documents After failing to improve the ranking performance beyond an MRR@5 of 78%
on the dev dataset, we began investigating potential issues with the reranking process.
800 772
s
t
en600
m
u
c
o
tD400
n
a
v
e
l
e
#R200
0</p>
      <p>102
1
2
44
3
40
4
15
5
57
87
57
42
40
26
25
36</p>
      <p>57
6–10
11–25
26–50
51-100
101-200
201-300
301-500
501-t1i0n00top-1000</p>
      <p>o</p>
      <p>N</p>
      <p>Rank Position</p>
      <p>To analyze retrieval performance, we plotted the rank distribution of relevant documents across
all queries. As shown in Figure 1, BM25 already ranks many relevant documents highly: 772 are
ranked first, and 1174 out of 1400 relevant documents appear within the top 50—the subset considered
for reranking. However, this also means that the remaining 226 relevant documents fall outside the
reranking scope (i.e., beyond rank 50) and thus cannot be influenced by the reranker. Additionally, a
small portion of relevant documents are not retrieved at all within the top 1000 results.
800
s
tn600
e
m
u
c
o
D
tn400
a
v
e
l
e
R
200
0
800</p>
      <p>129
1
2
50
3
31
4
38
5</p>
      <p>73
6–10
89
54
47
26</p>
      <p>16
11–25
26–50
51-100
101-200
201-300</p>
      <p>301-500
Rank Position</p>
      <p>47
0</p>
      <p>0
501-1000</p>
      <p>top-1000
tin
o
N</p>
      <p>In contrast to BM25, Figure 2 illustrates the performance of our best dense retriever, GTR-T5-Large.
Here, 1264 out of 1400 relevant documents are retrieved within the top 50, indicating improved coverage
within the reranking window.</p>
      <p>1,000
s
t
en 800
m
u
coD 600
t
n
leva 400
e
R
# 200
0
1,014
1
87
2
37
3
30
4
18
5</p>
      <p>50
6–10</p>
      <p>43
11–25</p>
      <p>21
26–50
tin
o
N</p>
      <p>100
top-50</p>
      <p>Rank Position</p>
      <p>Figure 3 demonstrates that our reranker improved the rankings for many relevant documents: 1014
are now ranked first, and 87 are ranked second. Still, 100 relevant documents are absent from the top
50 entirely.</p>
      <p>This analysis highlights an opportunity to focus on improving the retrieval of long-tail research
documents, which could further enhance overall ranking performance.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Summary and Future Work</title>
      <p>In this work, we presented our approach and results for the CheckThat! Claim Retrieval Shared
Task. We evaluated the impact of three diferent first-stage retrieval algorithms and four diferent
ranking models. Additionally, we examined whether augmenting the input with full-text or summary
information could improve performance. Contrary to our expectations, we found that this additional
information negatively afected retrieval efectiveness, but this could be due to missing data and needs
further exploration.</p>
      <p>Furthermore, our results show that open-domain listwise rerankers consistently outperform
domainspecific pointwise rerankers even those with significantly larger model sizes. This suggests that ranking
strategy plays a more crucial role than model size or domain specialization in this task.</p>
      <p>For future work, we propose exploring the use of an additional retrieval or classification layer
applied to the top five retrieved documents. This could help refine the final ranking and improve
overall performance, by improving the ranking of the around 200 documents which are in the top 5
but not ranked top . Such an approach would allow the use of larger models and more detailed input
representations in a computationally feasible way.</p>
      <p>A more important objective would be to improve the retrieval of the long-tail of research documents,
which are retrieved and ranked by our first-stage retrieval engine between rank 50 and rank 300.
Therefore, it is important to investigate why these documents are not ranked higher in the first-stage.</p>
      <p>Another promising direction would be to construct or extend datasets that include full-text versions of
the cited scientific papers. This would enable a more systematic evaluation of how diferent information
granularities abstracts, summaries, and full-texts impact both retrieval and ranking performance.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work is supported by the ANR Kodicare bi-lateral project, grant ANR-19-CE23-0029 of the French
Agence Nationale de la Recherche, and by the Austrian Science Fund (FWF, grant I4471-N).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4o in order to: check grammar and spelling.
After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take
full responsibility for the publication’s content.
[3] S. Hafid, Y. S. Kartal, S. Schellhammer, K. Boland, D. Dimitrov, S. Bringay, K. Todorov, S. Dietze,</p>
      <p>Overview of the CLEF-2025 CheckThat! lab task 4 on scientific web discourse, ????
[4] X. Zeng, A. S. Abumansour, A. Zubiaga, Automated fact-checking: A survey, Language and</p>
      <p>Linguistics Compass 15 (2021) e12438.
[5] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense
passage retrieval for open-domain question answering, in: B. Webber, T. Cohn, Y. He, Y. Liu
(Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 6769–6781. URL:
https://aclanthology.org/2020.emnlp-main.550/. doi:10.18653/v1/2020.emnlp- main.550.
[6] D. Wadden, S. Lin, K. Lo, L. L. Wang, M. van Zuylen, A. Cohan, H. Hajishirzi, Fact or fiction:
Verifying scientific claims, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the
2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association
for Computational Linguistics, Online, 2020, pp. 7534–7550. URL: https://aclanthology.org/2020.
emnlp-main.609/. doi:10.18653/v1/2020.emnlp- main.609.
[7] R. Nogueira, Z. Jiang, R. Pradeep, J. Lin, Document ranking with a pretrained sequence-to-sequence
model, in: T. Cohn, Y. He, Y. Liu (Eds.), Findings of the Association for Computational Linguistics:
EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 708–718. URL: https:
//aclanthology.org/2020.findings-emnlp.63/. doi:10.18653/v1/2020.findings- emnlp.63.
[8] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, Ms marco: A human
generated machine reading comprehension dataset (2016). URL: https://www.microsoft.com/en-us/
research/publication/ms-marco-human-generated-machine-reading-comprehension-dataset/.
[9] Q. Chen, Y. Peng, Z. Lu, BioSentVec: creating sentence embeddings for biomedical texts , in:
2019 IEEE International Conference on Healthcare Informatics (ICHI), IEEE Computer Society, Los
Alamitos, CA, USA, 2019, pp. 1–5. URL: https://doi.ieeecomputersociety.org/10.1109/ICHI.2019.
8904728. doi:10.1109/ICHI.2019.8904728.
[10] D. Wadden, K. Lo, B. Kuehl, A. Cohan, I. Beltagy, L. L. Wang, H. Hajishirzi, SciFact-open: Towards
open-domain scientific claim verification, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Findings
of the Association for Computational Linguistics: EMNLP 2022, Association for Computational
Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 4719–4734. URL: https://aclanthology.org/
2022.findings-emnlp.347/. doi:10.18653/v1/2022.findings- emnlp.347.
[11] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. Kinney, Z. Liu, W. Merrill,
P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson, A. D. Wade, K. Wang, C. Wilhelm,
B. Xie, D. Raymond, D. S. Weld, O. Etzioni, S. Kohlmeier, CORD-19: the covid-19 open research
dataset, CoRR abs/2004.10706 (2020). URL: https://arxiv.org/abs/2004.10706. arXiv:2004.10706.
[12] J. Ni, C. Qu, J. Lu, Z. Dai, G. Hernandez Abrego, J. Ma, V. Zhao, Y. Luan, K. Hall, M.-W. Chang,
Y. Yang, Large dual encoders are generalizable retrievers, in: Y. Goldberg, Z. Kozareva, Y. Zhang
(Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,
Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 9844–9855.</p>
      <p>URL: https://aclanthology.org/2022.emnlp-main.669/. doi:10.18653/v1/2022.emnlp- main.669.
[13] O. Khattab, M. Zaharia, Colbert: Eficient and efective passage search via contextualized late
interaction over bert, in: Proceedings of the 43rd International ACM SIGIR Conference on Research
and Development in Information Retrieval, SIGIR ’20, Association for Computing Machinery,
New York, NY, USA, 2020, p. 39–48. URL: https://doi.org/10.1145/3397271.3401075. doi:10.1145/
3397271.3401075.
[14] S. Althammer, A. Askari, S. Verberne, A. Hanbury, Dossier@coliee 2021: Leveraging dense retrieval
and summarization-based re-ranking for case law retrieval, CoRR abs/2108.03937 (2021). URL:
https://arxiv.org/abs/2108.03937. arXiv:2108.03937.
[15] F. Schlatt, M. Fröbe, H. Scells, S. Zhuang, B. Koopman, G. Zuccon, B. Stein, M. Potthast, M.
Hagen, Set-encoder: Permutation-invariant inter-passage attention for listwise passage re-ranking
with cross-encoders, in: C. Hauf, C. Macdonald, D. Jannach, G. Kazai, F. M. Nardini, F. Pinelli,
F. Silvestri, N. Tonellotto (Eds.), Advances in Information Retrieval, Springer Nature Switzerland,
Cham, 2025, pp. 1–19.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Staudinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Ebshihy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Ningtyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <surname>AMATU@</surname>
          </string-name>
          <article-title>SimpleText2024: are LLMs any good for scientific leaderboard extraction</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings</source>
          , CEUR-WS,
          <year>Online</year>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vladika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Matthes</surname>
          </string-name>
          ,
          <article-title>Scientific fact-checking: A survey of resources and approaches</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>6215</fpage>
          -
          <lpage>6230</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .findings-acl.
          <volume>387</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .findings- acl.387.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>