<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>VerbaNexAI at CheckThat! 2025: Fine-Tuning DeBERTa for Multi-Label Scientific Discourse Detection in Tweets⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mervin Jesus Sosa Borrero</string-name>
          <email>sosam@utb.edu.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jairo Enrique Serrano Castañeda</string-name>
          <email>jserrano@utb.edu.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan Carlos Martinez Santos</string-name>
          <email>jcmartinezs@utb.edu.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edwin Alexander Puertas Del Castillo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Tecnológica de Bolívar, School of Digital Transformation, Cartagena de Indias</institution>
          ,
          <addr-line>130010</addr-line>
          ,
          <country country="CO">Colombia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents VerbaNexAI's submission to Task 4a of the CheckThat! 2025 Lab, which focuses on the identification of scientific discourse in English-language tweets. We propose a multi-label classification approach based on a fine-tuned DeBERTa-v3 model, optimized through stratified cross-validation, threshold calibration using precision-recall curves, and ensemble prediction with soft-voting. Our system ranked 2nd overall in the oficial leaderboard with a macro-averaged F1 score of 0.7983 and achieved the top F1 score (0.8133) in Category 1 (scientific claims), demonstrating strong performance in detecting verifiable assertions in noisy social media contexts. To address class imbalance and label sparsity, we employed class-specific weighting and threshold tuning strategies. The final system combines predictions from multiple folds and loss configurations, resulting in robust generalization. Code, models, and evaluation scripts are publicly available to promote reproducibility and further research on trustworthy scientific information detection in social platforms.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Scientific Discourse Detection</kwd>
        <kwd>Multi-label Classification</kwd>
        <kwd>Transformer Models</kwd>
        <kwd>Twitter Data</kwd>
        <kwd>Threshold Optimization</kwd>
        <kwd>CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, the widespread use of social media platforms has transformed the way scientific
information is produced, shared, and consumed. Platforms such as Twitter play a central role in shaping
public discourse on critical scientific issues, including public health, climate change, and technology
policy. However, this rapid dissemination of information also amplifies challenges around the reliability,
credibility, and traceability of scientific claims online. In response, the computational community
has increasingly focused on developing systems to automatically detect, verify, and classify scientific
discourse across social platforms.</p>
      <p>
        The CLEF CheckThat! Lab [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has emerged as a pivotal benchmark for evaluating such systems.
In particular, Task 4a of the CheckThat! 2025 Lab [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] addresses the detection of scientific discourse
in tweets, grounded in an annotation framework established in prior work such as SciTweets. This
framework introduces a nuanced categorization of science-related content, distinguishing between:
(1) scientifically verifiable claims, (2) references to scientific knowledge, and (3) mentions of scientific
research or context. This task is a multi-label classification challenge complicated by linguistic variability,
domain ambiguity, and class imbalance—factors that mirror the real-world complexity of scientific
communication.
      </p>
      <p>CLEF 2025 Working Notes, 9 – 12 September 2025, Madrid, Spain
⋆You can use this document as the template for preparing your publication. We recommend using the latest version of the
ceurart style.
* Corresponding author.</p>
      <p>Based on this foundation, we present a multi-stage pipeline designed to detect scientific discourse
in tweets. Our approach integrates transformer-based architectures, threshold calibration techniques,
and ensemble learning strategies to maximize performance across all three categories. Specifically,
we fine-tuned the microsoft/deberta-v3-base model using stratified 5-fold cross-validation, optimizing
hyperparameters such as learning rate and training epochs. To handle label imbalance, we introduced
class-specific weighting using the BCEWithLogitsLoss function and applied precision-recall-based
threshold tuning per label. We produced final predictions through a soft-voting ensemble of the
topperforming models. Unlike previous approaches that relied on heuristic sampling or simpler classifiers,
our system emphasizes careful optimization and generalization.</p>
      <p>We trained solely on the oficial ct-train.tsv dataset, without external corpora or retrieval
augmentation, thereby validating its robustness in a constrained but realistic setting. Importantly, we did not
employ additional pretraining, zero-shot methods, or rule-based heuristics. It highlights the strength of
focused fine-tuning and calibration.</p>
      <p>Our system achieved competitive results in the oficial evaluation of Task 4a. It ranked 2nd overall in
macro-averaged F1-score (0.7983) across all categories. Notably, it obtained the highest F1-score (0.8133)
in Category 1, which focuses on scientifically verifiable claims—arguably the most impactful category
for downstream fact-checking and credibility analysis. These results demonstrate that our approach
not only generalizes well but also excels in the most challenging and socially relevant facet of the task.
This paper documents our system, from data preprocessing and model selection to experimental design
and result analysis. We also reflect on the implications of our design choices and propose avenues for
future research, including multilingual adaptation, external knowledge integration, and explainability
in scientific discourse classification.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Tasks and Objectives</title>
      <p>The primary objective of our participation in Task 4a of the CheckThat! 2025 Lab was to design a
robust and reproducible system capable of accurately identifying diferent types of scientific discourse
in tweets. This involved addressing the multi-label nature of the task, optimizing performance across
all three predefined categories: scientifically verifiable claims (C1), references to scientific knowledge
(C2), and mentions of scientific research or context (C3) Table 1 summarizes these three categories.</p>
      <p>We formulated the classification problem as a multi-label task, where each tweet may belong to zero,
one, two, or all three categories simultaneously. The dataset, provided in tabular format with tweets and
corresponding binary vectors (e.g., ‘[0.0, 1.0, 0.0]‘), reflects the challenges inherent in analyzing social
media content, including informal language, abbreviations, hashtags, hyperlinks, emojis, and occasional
sarcasm or irony. A significant class imbalance further complicates the task, with a majority of tweets
falling into the “no-category” case (‘[0.0, 0.0, 0.0]‘). We evaluated performance using macro-averaged
F1-score across the three categories.</p>
      <sec id="sec-2-1">
        <title>2.1. Main Objectives of Experiments</title>
        <p>Our primary goal was to develop a scalable system that achieves high predictive performance across all
three categories while ensuring consistency and reproducibility. Table 2 summarizes the guiding design
principles that structured our approach. Through these objectives, we aim to demonstrate that careful
model tuning and architectural design can yield high-quality results in challenging multi-label settings,
even in the absence of auxiliary data or external knowledge sources.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>We built our system for Task 4a of CheckThat! 2025 upon a transformer-based architecture designed
for multi-label classification of scientific discourse in tweets. The overall methodology consisted of
four key stages: data preprocessing, model fine-tuning, threshold calibration, and ensemble integration.
All experiments were conducted exclusively on the English-language training set provided by the
organizers, without incorporating external corpora, retrieval systems, or manually crafted rules. Our
pipeline is illustrated in Figure 1, which outlines the main processing stages.</p>
      <sec id="sec-3-1">
        <title>3.1. Model Architectures</title>
        <p>
          We selected microsoft/deberta-v3-base as the backbone model for our system based on its
empirical performance and architectural advantages over other widely used transformers such as BERT,
RoBERTa, and XLNet. DeBERTa (Decoding-enhanced BERT with disentangled attention) improves upon
BERT-based models by decoupling positional and content-based attention, enabling the model to capture
longer-range dependencies and nuanced phrase structures more efectively. It is particularly beneficial
for scientific discourse, where key information may appear in complex or indirect expressions [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>Compared to RoBERTa, which enhances BERT with more robust pretraining, DeBERTa additionally
integrates a relative position bias and enhanced mask decoder that improves fine-tuning performance
in sentence-level classification tasks. XLNet, while powerful in autoregressive settings, introduces
increased complexity and instability in multi-label fine-tuning for short-form noisy text such as tweets.</p>
        <p>Raw Tweet
Text (Input)</p>
        <p>DeBERTa
Tokenizer + Preproc</p>
        <p>Fine-tuned
DeBERTa-v3</p>
        <p>Predicted
Multi-label Output</p>
        <p>PR Curve +
Threshold Tuning
(per label)
Soft-Voting
Ensemble
We conducted preliminary ablation experiments with RoBERTa-base and BERT-base on a validation
fold. We observed that DeBERTa-v3 consistently achieved 2–3 F1 points higher on Category 1 (scientific
claims) and showed better calibration across all thresholds. For these reasons, we adopted
DeBERTa-v3base as the most appropriate foundation for our multi-label tweet classifier.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Preprocessing</title>
        <p>The input data consists of tweets accompanied by binary vectors that indicate the presence or absence
of each of the three target categories. We tokenized the tweets using the Hugging Face implementation
of the DeBERTa tokenizer, with a maximum sequence length of 128 tokens. Special tokens, hashtags,
emojis, and URLs were preserved during tokenization, as these often provide important context in
scientific discourse on social media. No additional text normalization or truncation was applied beyond
the tokenizer’s built-in handling.</p>
        <p>To optimize classification performance, we followed a multi-stage training and ensembling strategy
summarized in Algorithm 1. This pipeline involves training two model variants separately: one using
the BCEWithLogitsLoss function with class weights (via the pos_weight parameter) to mitigate
label imbalance, and another using the same loss function without weights. Both variants are trained
using 5-fold stratified cross-validation, and their best-performing checkpoints (based on macro F1-score)
are retained. Final predictions are obtained by applying soft-voting over the outputs of the top models
from each variant.</p>
        <p>Algorithm 1 Multi-Stage Fine-Tuning Strategy for Multi-Label Tweet Classification. The pipeline
includes two model variants trained independently—with and without class weights—and combines
their predictions via soft-voting.</p>
        <p>1: Initialize DeBERTa-v3-base model with pretrained weights.
2: Set training hyperparameters: learning rate, epochs, batch size.
3: Tokenize tweets using DeBERTa tokenizer (max 128 tokens).
4: Prepare optimizer (AdamW) and loss function (BCEWithLogitsLoss).
5: for each fold in 5-fold cross-validation do
6: Train on four folds and validate on the 5th.
7: Tune thresholds using precision-recall curve.
8: Save the best model based on macro F1.
9: end for
10: Load top 2 models (with and without class weights).
11: Compute soft-voting over prediction probabilities.
12: Return final predictions.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Training Configuration</title>
        <p>
          We trained the model using the BCEWithLogitsLoss function [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] to accommodate the multi-label
nature of the task. To address class imbalance—particularly the prevalence of negative-only examples
([0, 0, 0])—we applied label-specific positive class weighting via the pos-weight parameter computed
from the training label distribution. We utilized the Paged AdamW optimizer (32-bit) from Hugging
Face Accelerate, incorporating weight decay, and explored three learning rates (1e-5, 2e-5, and 3e-5)
along with training durations of 6, 10, and 12 epochs. The use of stratified cross-validation ensured
that the training folds reflected the class distribution of the original dataset, a common strategy in
multi-label learning tasks [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We initially conducted a grid search over three learning rates
(1e5, 2e-5, 3e-5) and epoch counts (6, 10, 12) using the standard AdamW optimizer. However, due to
memory constraints in our Colab environment and instability observed with longer training cycles, we
subsequently switched to the Paged AdamW optimizer provided by Hugging Face Accelerate, which
supports gradient accumulation and mixed precision training. Under this configuration, we observed
that a slightly higher learning rate (2e-4) combined with a shorter cycle of 3 epochs yielded better
generalization performance on validation folds and faster convergence. We empirically validated the
change through multiple runs. We selected the final hyperparameters based on the best macro-averaged
F1 Score obtained during cross-validation. Table 3 reflects these final settings, which diverge slightly
from the initial grid due to optimizer-related changes and runtime constraints. We conducted training
and evaluation using stratified 5-fold cross-validation [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to ensure balanced label distributions across
folds. Table 3 provides a summary of the final configuration and training parameters used in the
best-performing model. We evaluated model checkpoints using macro-averaged F1-score across the
three categories, consistent with the oficial task metric.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Threshold Calibration and Ensemble</title>
        <p>We found that default sigmoid thresholds of 0.5 are suboptimal due to class imbalance and skewed
prediction probabilities. To address this, we applied threshold tuning [6] per class using the
precisionrecall-curve function from sci-kit-learn, selecting the threshold that maximized the F1-score on each
validation fold for each label independently. Finally, we constructed an ensemble [7] by combining the
two best-performing models—one trained with class weighting and one without—using soft-voting over
the predicted probabilities. This approach stabilized predictions and improved generalization on the
test set. We implemented all model components using PyTorch and Hugging Face Transformers.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Implementation Details</title>
        <p>All experiments were implemented in Python using Hugging Face’s Transformers and Accelerate
libraries along with PyTorch 2.0. The training was conducted in Google Colab using a single NVIDIA
Tesla T4 GPU with 16 GB of VRAM. Each fold in the 5-fold cross-validation required approximately
35 minutes to complete, including both the training and validation phases. We used mixed-precision
training (fp16) to reduce memory usage and accelerate computation. We executed the full training
pipeline within Linux-based virtual environments, utilizing CUDA 11.8 support.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We evaluated the model on the oficial test set (ct-test.tsv) provided by Task 4a in CheckThat! 2025.
Final predictions were submitted via Codalab and assessed using the task’s primary evaluation metric:
macro-averaged F1-score across the three target categories [6].</p>
      <sec id="sec-4-1">
        <title>4.1. Development Results</title>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Oficial Evaluation Results</title>
        <p>The final predictions on the test set submitted to the Codalab platform yielded the following macro-F1
scores:
• Overall macro-F1: 0.7983
• Category 1 (C1): 0.8133 (Ranked 1st among all participants)
• Category 2 (C2): 0.7841
• Category 3 (C3): 0.7976</p>
        <p>Our system ranked 2nd overall and achieved the highest score in the most critical category for
scientific verification: scientifically verifiable claims (C1). These results demonstrate the efectiveness
of our multi-stage fine-tuning and ensemble strategy, particularly in distinguishing scientific claims
(Category 1), where our system achieved the highest ranking. All code, model checkpoints, and training
scripts used in this study are publicly available at our GitHub repository [8]</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Comparison to Baseline</title>
        <p>Compared to the oficial baseline provided by the organizers, which achieved an overall F1 of 0.718,
our approach improved macro-F1 by approximately 8%. Key factors contributing to this improvement
include threshold calibration and ensemble integration.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Analysis of the Results</title>
      <p>
        The results from the oficial evaluation reveal distinct performance patterns across the three classification
categories. Our system achieved its highest F1-score in Category 1 (Scientific Claims), reaching 0.8133.
This strong result likely stems from the model’s capacity to identify explicit, verifiable assertions
that align structurally and semantically with scientific discourse. Tweets in this category frequently
feature recognizable linguistic markers—such as causal constructions, statistical phrasing, or study
citations—that are efectively captured through contextual encoding by transformer architectures [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <sec id="sec-5-1">
        <title>5.1. Performance per Category</title>
        <p>
          The results obtained in the oficial evaluation reveal important distinctions in performance across the
three target categories. We achieved the highest F1 score in Category 1 (Scientific Claims), with a system
score of 0.8133. We can attribute this strong performance to the model’s ability to capture explicit,
verifiable assertions that are often more structurally and semantically aligned with scientific language.
Many of these tweets contain indicative patterns—such as causal statements, statistical language, or
references to studies—that benefit from contextual encoding in transformer architectures [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>In contrast, Category 2 (References) yielded the lowest F1-score (0.7719) among the three. This
category includes both direct and indirect references to scientific sources, which can be subtle, implicit,
or dependent on external knowledge (e.g., recognizing that a link points to a scientific repository) [ 9].
Tweets in this category often present ambiguity or lack suficient textual cues to signal a reference
reliably. Since we trained the system solely on tweet text without leveraging metadata or external link
resolution, it may have missed signals necessary to capture this label fully. The inclusion of explicit
features, such as DOIs, paper titles, or domain-specific cues, could improve classification precision in
future iterations.</p>
        <p>Category 3 (Research Context), with an F1-score of 0.8098, reflects a middle ground in dificulty.
Tweets labeled under this class often refer to researchers, institutions, or ongoing studies without
making verifiable claims or citing specific resources. The model’s performance suggests it efectively
learned to identify institutional or contextual phrases (e.g., "new study from," "Harvard researchers,"
"ongoing trials"), which likely served as discriminative features.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Efect of Calibration</title>
        <p>The inclusion of threshold tuning per label, using precision-recall curves, had a notable positive efect,
particularly in balancing recall and precision for the less-represented categories. Similarly, class-specific
weighting via the pos-weight parameter helped reduce the bias toward the dominant no-label class ([0,
0, 0]), ensuring the model remained sensitive to minority classes.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Observed Limitations</title>
        <p>While the system demonstrated overall robustness, several limitations persist. Tweets exhibiting sarcasm,
fragmented phrasing, or lacking explicit linguistic cues proved especially dificult to classify reliably. In
particular, short tweets that only contained links, emojis, or hashtags often lacked suficient contextual
signals to support confident categorization. These were especially problematic in distinguishing between
Category 2 (Reference) and Category 3 (Context), as both may mention research implicitly but difer in
intent.</p>
        <p>Furthermore, Category 2 presented persistent ambiguity, as references to scientific knowledge were
often made implicitly through hyperlinks or vague mentions (e.g., "see this" or "the study shows")
without explicit citations or domain indicators. Since our system relied solely on tweet text and excluded
metadata or link resolution, it struggled to infer whether such tweets truly referenced scientific sources.
For example, the lack of URL domain analysis (e.g., arxiv.org, pubmed.ncbi.nlm.nih.gov) limited the
model’s ability to detect indirect references.</p>
        <p>Additionally, our approach did not incorporate conversational or user-level features—such as quote
tweets, thread continuity, or reply context—which could clarify meaning in multi-tweet exchanges. As
a result, label predictions occasionally failed to capture implicit discourse continuity. Future iterations
may benefit from integrating URL resolution, external knowledge bases, or discourse-aware modeling
strategies to improve robustness in real-world social media scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Perspectives for Future Work</title>
      <p>The competitive performance achieved in Task 4a highlights clear directions for enhancing the system’s
capabilities. While the current approach relies solely on tweet text and supervised fine-tuning, future
iterations can integrate architectural, contextual, and multilingual improvements. Below, we outline
four specific enhancements, their rationale, and the implementation paths.</p>
      <p>1. Integration of Resolved URLs and Domain Signals. Many tweets in Category 2 implicitly
reference scientific content through shortened links. To address this, we plan to incorporate a URL
resolution pipeline using tools such as newspaper3k or tldextract, which enables the extraction of
domain names (e.g., pubmed.ncbi.nlm.nih.gov, arxiv.org) and page metadata. We can encode
these features as auxiliary embeddings or categorical indicators within the model. Implementation will
require API access and preprocessing routines for link expansion.</p>
      <p>2. Discourse-Aware and Contextual Modeling. The current model treats each tweet
independently, which limits its ability to capture implicit context. Future versions will utilize discourse
structures, such as tweet threads, quote-retweets, and user reply chains, via the Twitter API. We aim to
explore hierarchical models (e.g., HierBERT) and contrastive learning approaches to encode inter-tweet
dependencies better.</p>
      <p>3. Multilingual Adaptation. Given the global nature of science communication, extending the
classifier to support additional languages is a priority. We plan to adapt multilingual transformer
models such as XLM-R and mDeBERTa, with a focus on Spanish and Portuguese. It will involve
collecting a parallel annotated corpus following the SciTweets schema, possibly through partnerships
with fact-checking organizations in Latin America or through crowdsourcing platforms.</p>
      <p>4. Explainability and Transparency. To foster model interpretability, we intend to integrate
explainability tools such as BertViz, exBERT, SHAP, or Integrated Gradients. These tools can visualize
attention weights or saliency scores, helping users understand model decisions—particularly in
ambiguous or borderline cases. We release all code, model checkpoints, and training logs as open-source via
GitHub and track them through Weights &amp; Biases. These eforts aim not only to enhance
classicfiation performance but also to support reproducibility, trust, and community collaboration in scientific
discourse detection.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>We would like to thank the organizers of the CheckThat! Lab at CLEF 2025 for providing a well-structured
and valuable evaluation framework. This work was supported in part by the Universidad Tecnológica
de Bolívar and its School of Digital Transformation. We also acknowledge the collaborative eforts of
the VerbaNexAI team for their dedication to advancing scientific discourse detection and reproducibility
in social media NLP research. We extend special thanks to the Colombian Navy, through its Naval
Education Command (Jefatura de Educación Naval) and the Naval Technological Development Center
(Centro de Desarrollo Tecnológico Naval), for their financial support and institutional endorsement of
this research. We will make the system and code described in this paper publicly available to promote
reproducibility and community collaboration.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT-4 and Grammarly for grammar and style
revision. All outputs were carefully reviewed and edited by the authors, who take full responsibility for
the final content.
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational
Linguistics, 2019, pp. 3615–3620. doi:10.18653/v1/D19-1371.
[6] T. Saito, M. Rehmsmeier, The precision-recall plot is more informative than the roc plot when
evaluating binary classifiers on imbalanced datasets, PLOS ONE 10 (2015) e0118432. doi: 10.1371/
journal.pone.0118432.
[7] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive
pretraining for language understanding, in: Advances in Neural Information Processing Systems
32 (NeurIPS), Curran Associates, Inc., 2019, pp. 5754–5764. URL: https://proceedings.neurips.cc/
paper_files/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf.
[8] M. J. Sosa-Borrero, J. E. Serrano, J. C. Martinez-Santos, E. Puertas, Verbanexai lab at checkthat!
2025: Scientific discourse detection in tweets – code repository, https://github.com/VerbaNexAI/
CLEF2025/tree/main/CheckThat, 2025. Accessed: 2025-05-29.
[9] S. Hafid, S. Schellhammer, S. Bringay, K. Todorov, S. Dietze, Scitweets: A dataset and annotation
framework for detecting scientific online discourse, in: Proceedings of the 31st ACM International
Conference on Information &amp; Knowledge Management (CIKM), ACM, 2022, pp. 3988–3992. doi:10.
1145/3511808.3557516.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hafid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Korre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schellhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <surname>V. V.</surname>
          </string-name>
          ,
          <article-title>The clef-2025 checkthat! lab: Subjectivity, fact-checking, claim normalization, and retrieval</article-title>
          , in: C.
          <string-name>
            <surname>Hauf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Jannach</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          <string-name>
            <surname>Nardini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Pinelli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Silvestri</surname>
          </string-name>
          , N. Tonellotto (Eds.),
          <source>Advances in Information Retrieval. ECIR 2025. Lecture Notes in Computer Science</source>
          , volume
          <volume>15576</volume>
          , Springer Nature Switzerland, Cham,
          <year>2025</year>
          , pp.
          <fpage>467</fpage>
          -
          <lpage>478</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -88720-8_
          <fpage>68</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hafid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kartal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schellhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Boland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bringay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! lab task 4 on scientific web discourse</article-title>
          ,
          <source>in: Working Notes of CLEF</source>
          <year>2025</year>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen, Deberta:
          <article-title>Decoding-enhanced bert with disentangled attention</article-title>
          ,
          <source>in: Proceedings of the 9th International Conference on Learning Representations (ICLR)</source>
          ,
          <source>OpenReview.net</source>
          ,
          <year>2021</year>
          . URL: https://openreview.net/forum?id=XPZIaotutsD.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.-H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Sabuncu</surname>
          </string-name>
          ,
          <article-title>Generalized cross entropy loss for training deep neural networks with noisy labels</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          <volume>31</volume>
          (
          <issue>NeurIPS</issue>
          ), Curran Associates, Inc.,
          <year>2018</year>
          , pp.
          <fpage>8792</fpage>
          -
          <lpage>8802</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <article-title>Scibert: A pretrained language model for scientific text</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>