<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Discourse⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arpan Majumdar</string-name>
          <email>arpanmajumdar952@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dipankar Das</string-name>
          <email>dipankar.dipnil2005@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pritam Pal</string-name>
          <email>pritampal522@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Jadavpur University</institution>
          ,
          <addr-line>West Bengal</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Kalyani</institution>
          ,
          <addr-line>West Bengal</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>With the abundance of scientific information and misinformation propagated on social media, especially on Twitter, there is an urgent need for automated systems to classify science-related information. The CLEF-2025 Task 4a [1] of the CheckThat! Lab is intended to detect three types of scientific discourse found in tweets: scientific claims, mentions of scientific studies, and references to scientific entities. In this paper, we present a hybrid transformer-based solution which utilizes SciBERT, a model of scientific-text training, and Twitter-RoBERTa, a specifically-tuned protocol for tweets. By taking advantage of the strengths of both models we capture the formal, technical aspects of scientific language and the informal, noisy conventions of social media discourse. The pooled embeddings from both encoders are concatenated and fed through a multilayer classification head that incorporates dropout as regularization and sigmoid activation for multilabel prediction. The model was trained using binary cross-entropy loss and incorporated early stopping and adaptive learning rate scheduling. Our model achieved a macro-averaged F1 score of 0.8262 on the development set, as well as a minimum validation loss of 0.1744. These results provide evidence for the benefit of hybrid pretraining for scientific tweet classification and provide a foundation for future extensions, which could incorporate relationships between labels and multilingual aspects.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>As the shift of scientific communication continues to shift to social media, the ability to automatically
ifnd and analyze science-related content has become a necessary tool for researchers, journalists, and
fact-checkers (simply referred as ”users” in this task). Within Twitter, there is a unique mix of public
conversations involving scientific claims, and informal commentary, news, and misinformation, creating
a complex environment. The CheckThat! Lab at CLEF-2025 could see that demand for tools that detect
particular forms of discourse and introduced Task 4a - an attempt to detect three forms of science web
discourse in individual tweets. The three forms of science web discourse consist of scientific claims,
references to scientific studies, and mentions of scientific entities like researchers and institutions.</p>
      <p>
        Due to the interplay between two very diferent forms of language - the formal and precise language
of scientific publications and the informal and context-dependent language of tweets - we acknowledge
it will be dificult for classification models, since the amount of variety of any form of language is
beyond the best of traditional text classification models. So, in this experiment, we introduce a hybrid
solution that exploits the complementary strengths of two pretrained transformer models - SciBERT
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], pretrained on a large collection of scientific publications, and Twitter-RoBERTa [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], fine-tuned for
social media text.
      </p>
      <p>We predict that an appropriate combination of these two embeddings will allow the model to better
navigate both the protocols present in scientific claims as well as the distinctive language of Twitter.
We propose a very simple and efective architecture that gets the pooled embeddings from each encoder
and simply concatenates them as inputs to a classification head for predicting the multilabel outputs.
(P. Pal)</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>The system is trained end-to-end using binary cross-entropy loss and optimized using early stopping
based on the validation performance. This manuscript describes the details of the dataset, the model
architecture, the training strategy, the evaluation results, and qualitative observations, and culminates
with a limitations and future work section.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset Description</title>
      <p>The data used for CLEF-2025 Task 4a is an annotated version of the SciTweets corpus, whose tweets
have been annotated for scientific discourse. The data is provided in three subsets: a training set with
1,229 labeled tweets, a development set (137 tweets), and a test set (240 tweets). Within each tweet
are three separate binary labels, which identify three independent tasks: labeling whether the tweet
contains a scientific claim, refers to a scientific study or publication, and identifies a scientific entity
(such as a university or scientist).</p>
      <p>
        Given that this is a multilabel dataset, any given tweet may contain none, one, or multiple labels.
The training set has a fairly balanced distribution of the three labels, which means the model cannot
simply learn to identify an arbitrary label based on counting approaches. The annotation developed
for the study relies on the SciTweets framework and provides clear instructions to labelers to achieve
consistency in labeling each following examples. Thus, this dataset presents a realistic and challenging
scenario that models labeled for multilabel classification [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] can be developed to navigate nuanced
discourse in a scientific context across social media.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Related Work</title>
      <p>
        Recent advances in transformer models have significantly improved scientific text analysis and discourse
classification. SciBERT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a BERT variant pretrained on scientific publications, demonstrates superior
performance in capturing scientific language nuances for downstream scholarly NLP tasks. This success
extends to other domain-specific models like BioBERT and LegalBERT, establishing transformer-based
architectures with specialized pretraining as the standard approach in scientific NLP.
      </p>
      <p>
        For social media discourse analysis, specialized models have emerged to address platform-specific
challenges. BERTweet [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] targets Twitter-specific language patterns, while the TweetEval
benchmark consolidates seven heterogeneous tweet classification tasks, reporting strong performance from
transformer models further pretrained on Twitter corpora. The SciTweets dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] specifically
addresses scientific discourse on Twitter, introducing annotation guidelines for science-related tweets
and achieving approximately 89% F1-score in distinguishing scientific content using domain-specific
features.
      </p>
      <p>
        Recent shared tasks provide additional context for scientific discourse detection across platforms.
The CLEF CheckThat! lab [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] featured tasks on identifying check-worthy scientific claims in tweets and
news articles. Participants predominantly employed transformer architectures including multilingual
RoBERTa, XLM-RoBERTa, and GPT variants, consistently outperforming heuristic baselines. These
results reinforce that pretrained transformers constitute robust foundations for scientific discourse
classification tasks.
      </p>
      <p>
        Our approach to model eficiency draws inspiration from O’Neill et al.’s layer fusion technique [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
which identifies and merges similar layers in pretrained networks to achieve structured compression.
Their experiments demonstrated that transformers could be reduced to approximately 20% of their
original size while maintaining performance, with only minimal perplexity increases. We adapt this
methodology by aligning and fusing analogous transformer layers, preserving inter-layer information
while creating more compact models. Although fusion-based compression remains underexplored
in discourse classification, it ofers a principled framework for deploying large transformer models
eficiently in scientific discourse detection tasks.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>
        In order to be able to pragmatically address the issue of scientific discourse classesinee on Twitter, we are
proposing a hybrid transformer-based architecture that exploits domain-specific and social-contextual
language to inform its understanding. Further, this framework aims to leverage the two materials
SciBERT and the Twitter-RoBERTa - by separately combining via late fusion layer[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], followed by a
deeper classification head for multi-label predictions.
      </p>
      <p>At the core of our proposal lies a dual-encoder structure. We first preprocess the tweets using the
typical NLP processes such as tokenization, lowercasing, truncation etc., to create the input required
for the transformer models. Each input tweet results in two representations, where the tweets pass
through two encoder models separately. SciBERT is models trained on scientific literature and therefore
ofers coherent representations of domain-specific vocabularies as well as formal discourse. In parallel,
Twitter-RoBERTa parses a large volume of tweets capturing informal, noisy, and user generated text
identified to social media contexts.</p>
      <p>From both models, we obtain the contextualized [CLS] token embedding which is a compressed
semantic summary of the input text. We then concatenate these two 768-dimensional vectors into
a 1536-dimensional combined representation. This vector serves as the fusion layer combining the
benefits of the two encoders to account for the domain relevance and social context.</p>
      <p>We then send the fused representation through a classification head which consists of two fully
connected (FC) layers separated by dropout layers to mitigate overfitting. The first FC layer reduces the
dimension from 1536 to 512 which is subsequently followed by a ReLU activation and a dropout layer (p
= 0.3). The second FC layer then reduces the representation to 128 dimensions, followed by ReLU and
dropout. The last layer is the output layer, which applies activation functions Sigmoid, to produce three
probabilities associated with the labels, scientific, non-scientific, and uncertain. The sigmoid allows for
multi-label outputs which is a necessity in this task, where a tweet can fall into multiple categories.</p>
      <p>The model is trained with binary cross-entropy loss with early stopping based on the validation
macro F1-score, assuring convergence and generalization on unseen data.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>Our model trained cleanly and had a strong performance on the development set. The best model (by
early stopping) had a validation loss of 0.1744 and a macro-averaged F1 score of 0.8262. The validation
loss and F1 score on the development set conveys the model’s generalization across all three label
categories, without overfitting. Although the actual loss curve figure could not be included due to
submission constraints, we observed a clear downward trajectory in both training and validation losses
across epochs. The model began converging steadily after the third epoch, and early stopping was
triggered around the sixth epoch when validation loss plateaued. The consistent correlation between
training and validation losses indicates that the model learned meaningful features without overfitting.
correlated until they converged.</p>
      <p>Although our model performed well overall, the category-wise results reveal some variation in
label-specific performance. For instance, on the development set, the model achieved F1-scores of
0.7451, 0.8302, and 0.9014 for Category 1 (Scientific Claim), Category 2 (Study Reference), and Category
3 (Scientific Entity), respectively. This indicates stronger performance in identifying entities compared
to claims. Similarly, the test set shows a drop in Category 2 (Study Reference) with an F1 of 0.5965,
highlighting this as a relatively harder category. These results suggest label-specific challenges that
future work could address through additional training signals or class balancing strategies.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Observation</title>
      <p>During training, we found that the model had stable convergence because both the training and
validation losses steadily decreased, indicating that the model achieved some learning. The validation
Text Preprocessing</p>
      <p>Parallel Tokenization
SciBERT Encoder</p>
      <p>Twitter-RoBERTa</p>
      <p>Encoder</p>
      <p>Embedding Fusion
Fully Connected Layer</p>
      <p>Activation: ReLU</p>
      <p>Dropout
Fully Connected Layer</p>
      <p>Activation: ReLU</p>
      <p>Dropout
Output Layer
Sigmoid, 3</p>
      <p>Multi-label</p>
      <p>Classification Output
loss plateaued mid-way through the training process, approximately after the 6th epoch, which was
when our early stopping was triggered. With early stopping, we could ensure that the model did not
overfit the training data.</p>
      <p>It was also quite encouraging to see that the model’s macro-F1 score remained above 0.80 during the
early epochs and the final score was 0.8262. This further amplifies the importance of concatenation of
embeddings between SciBERT and Twitter-RoBERTa, which provided suficient features to the model
that would allow it to learn language features across a much broader feature space.</p>
      <p>While experimenting with the classification head, we also made adjustments to the structure of the
classification head by performing tests with varying dropout probabilities and number of units in the
fully connected layers. We found that two fully connected layers (512 intermediate units) and a dropout
probability of 0.5 provided the best trade-of between capacity and overall generalization.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This research introduced one possible hybrid model to detect scientific discourse in tweets, specifically
for the challenges of the CLEF-2025 Task 4a. Our two-encoder model successfully integrates the domain
knowledge common to nearly all forms of scientific texts in SciBERT, and the social-language knowledge
of Twitter-RoBERTa to classify tweets into three diferent scientific subject categories. The metrics
especially the macro-F1 of 0.8262 and validation loss (0.1744) - mean that this dual-encoder model has a
promising, promising direction for future scientific content classification work on social media.</p>
      <p>The primary value of our model is its connection between two distinct language domains, formal
language of science and informal language of social media. As a result, it is appropriate for this particular
issue, but potentially could be applied to another challenge related to domain adaptation or mixed
register text classification.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Limitations and Future Work</title>
      <p>Although the performance of our approach is promising, we recognize multiple limitations. First,
the model treats each of the three labels as independent. In practice, however, such categories are
often inter-dependent; for example, a scientific claim will often come with a reference to a study. To
improve performance, it would be beneficial to introduce label inter-dependencies in the approach via
conditional random fields, or a graph.</p>
      <p>Second, the model is currently trained on English tweets. As scientific discussions are becoming
increasingly hybrid and multilingual, we could extend the approach to either develop multilingual or
cross-lingual transformer models such as XLM-RoBERTa, or multilingual and cross-lingual research
studies more generally, where we could approach the same task in a more language-agnostic manner.</p>
      <p>A related improvement could include utilizing larger variants of models such as SciBERT-large, or
Twitter-RoBERTa-large, if compute was not an issue and better performance could be achieved due
solely to a model artifact. Lastly, the model could also be thought better if it was trained with additional
auxiliary information- such as tweet metadata, timestamps, or even contents of the articles they link
tothat may provide additional context that could improve classification.</p>
      <p>In light of these limitations, it should be recognized that this paper provides a framework, and the
experience and potential for continued research in the area of scientific discourse detection on social
media is perspicuous.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used generative AI tools including ChatGPT, Claude,
and Grammarly to support writing tasks such as grammar correction, sentence rephrasing, and
improving clarity. Specifically, these tools were used in drafting the Abstract, Related Work, and
Conclusion sections. All AI-generated content was carefully reviewed and edited by the authors. No AI
tool was listed as an author, and the authors take full responsibility for the accuracy and integrity of
the final manuscript.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hafid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Kartal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schellhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Boland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bringay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! lab task 4 on scientific web discourse</article-title>
          , ????
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <article-title>Scibert: A pretrained language model for scientific text</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1903</year>
          .10676. arXiv:
          <year>1903</year>
          .10676.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Espinosa</given-names>
            <surname>Anke</surname>
          </string-name>
          , L. Neves,
          <article-title>TweetEval: Unified benchmark and comparative evaluation for tweet classification</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2020</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>1644</fpage>
          -
          <lpage>1650</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .findings-emnlp.
          <volume>148</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .findings- emnlp. 148.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bogatinovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Todorovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Džeroski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kocev</surname>
          </string-name>
          ,
          <article-title>Comprehensive comparative study of multilabel classification methods</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>203</volume>
          (
          <year>2022</year>
          )
          <article-title>117215</article-title>
          . URL: https: //www.sciencedirect.com/science/article/pii/S0957417422005991. doi:https://doi.org/10.1016/ j.eswa.
          <year>2022</year>
          .
          <volume>117215</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Cohan,
          <article-title>SciBERT: A Pretrained Language Model for Scientific Text</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3615</fpage>
          -
          <lpage>3620</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. Q.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>Bertweet: A pre-trained language model for english tweets</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Hafid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Alghamdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alshehri</surname>
          </string-name>
          ,
          <article-title>Scitweets: A dataset and classifier for detecting scientific discourse on twitter</article-title>
          ,
          <source>arXiv preprint arXiv:2206.07360</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>J. O'Neill</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          <string-name>
            <surname>Delany</surname>
            ,
            <given-names>B. MacNamee</given-names>
          </string-name>
          ,
          <article-title>Model compression for bert using layer fusion</article-title>
          , arXiv preprint arXiv:
          <year>2012</year>
          .
          <volume>13136</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Neill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Steeg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Galstyan</surname>
          </string-name>
          ,
          <source>Compressing deep neural networks via layer fusion</source>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2007</year>
          .14917. arXiv:
          <year>2007</year>
          .14917.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>