<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Social Media Misinformation Detection Model Integrating Semantic and Twitter Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Junwei Peng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zijie Lin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhongyuan Han</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Guangdong</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The proliferation of misinformation on social media platforms has become a critical challenge in the digital age. This study presents a hybrid deep learning approach for detecting misinformation by combining ModernBERT with comprehensive feature engineering. We participated in the FIRE-2025 Task 3 shared task on misinformation detection. Our methodology integrates transformer-based language understanding with hand-crafted features extracted from text, user profiles, and social engagement patterns. To address the severe class imbalance problem, we employ Focal Loss with strategic resampling techniques. The experimental results demonstrate that our hybrid model achieves weighted F1-scores of 0.97 and 0.82 on the two oficial test datasets, respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Misinformation Detection</kwd>
        <kwd>Semantic Feature</kwd>
        <kwd>Social Media Feature</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Early approaches to misinformation detection primarily relied on hand-crafted features extracted from
news content combined with traditional machine learning classifiers. These methods exploited the
hypothesis that deceptive content exhibits distinctive patterns in writing style, enabling automatic
detection through statistical analysis. Zhou and Zafarani [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] concluded that, within traditional machine
learning frameworks, hand-crafted features for detecting fake news are typically extracted from four
linguistic levels of text: lexical, syntactic, semantic, and discourse. Building upon these feature categories,
representative early work includes: Feng et al. proposed a syntactic stylometry approach using CFG
parse tree features for deception detection [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and Pérez-Rosas et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed an integration of
lexical, syntactic, and semantic features with SVM classifiers for fake news identification across multiple
datasets. These traditional approaches established important foundations by revealing linguistic patterns
distinguishing deceptive content.
      </p>
      <p>The advent of deep learning opened new research directions for misinformation detection. Ajao et
al. [6] proposed hybrid CNN-RNN architectures that combine convolutional layers for local pattern
extraction with recurrent layers for sequential modeling, enabling the capture of both spatial and
temporal dynamics of deceptive language on Twitter. Additionally, Devlin et al. [7] proposed BERT,
which leverages masked language modeling to learn contextual representations from large-scale corpora.
BERT’s bidirectional attention mechanism enables it to capture nuanced semantic relationships crucial
for distinguishing subtle diferences between genuine and misleading content. Building upon BERT’s
foundation, subsequent variants such as RoBERTa [8] optimized pre-training procedures with dynamic
masking and larger batch sizes, and DeBERTa [9] introduced disentangled attention to separately
model content and position information in the attention mechanism. These advances have substantially
improved detection performance by better capturing semantic features and contextual nuances in social
media discourse.</p>
      <p>Beyond content analysis, research has demonstrated that social context features provide
complementary signals for misinformation detection. Castillo et al. [10] integrated user-based features alongside
content, demonstrating that user-level characteristics such as account age and follower count serve as
valuable indicators of source credibility on Twitter.</p>
      <p>Recognizing the value of these diverse features for social media misinformation detection, we propose
a hybrid model that integrates them to address the FIRE 2025 detection task.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task and Dataset Description</title>
      <sec id="sec-3-1">
        <title>3.1. Task Definition</title>
        <p>
          The FIRE-2025 Task 3 [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ] classifies Twitter posts into two categories: (i) Misinformation - posts
containing false, misleading, or unverified information, and (ii) Non-misinformation - posts containing
legitimate, accurate information.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset Description</title>
        <p>
          The dataset provided by the organizers [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ] is derived from Twitter posts about the Russo-Ukrainian
conflict [ 11], collected using the AMUSED annotation framework [12]. Each data instance contains
the following information: (i) Text - the main content of the social media post, (ii) User metadata
including follower count and friends count, and (iii) Engagement metrics - retweet count and favorite
count.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Dataset Statistics</title>
        <p>As shown in Table 1, the dataset exhibits severe class imbalance, with misinformation representing
approximately 1.054% of both the training and validation data.</p>
        <sec id="sec-3-3-1">
          <title>Dataset</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>Misinfo</title>
        </sec>
        <sec id="sec-3-3-3">
          <title>Non-misinfo</title>
          <p>Train
Val</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Evaluation Metrics</title>
        <p>The organizers provide the following metrics for evaluation: (i) Precision - the proportion of correct
misinformation predictions among all misinformation predictions, (ii) Recall - the proportion of actual
misinformation cases correctly identified, and (iii) Weighted F1-score - the F1-score weighted by class
support, providing overall model performance.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Twitter
Data
Tokenizer
ModernBERT
12 Layers
Semantic Features
Social Media</p>
      <p>Features
Linear 128</p>
      <p>ReLU
Linear 64</p>
      <p>ReLU
Social Media</p>
      <p>Feature
Embedding
Fusion
Linear 256</p>
      <p>ReLU
Concat
Linear 2
Classification</p>
      <sec id="sec-4-1">
        <title>4.1. Model Architecture</title>
        <p>Our hybrid model consists of three main components as shown in Figure 1:</p>
        <p>1. Semantic Features (blue blocks in Figure 1): We use ModernBERT-base as the backbone for
encoding textual content. The encoder consists of 12 transformer layers, each containing multi-head
self-attention and a feed-forward network with residual connections and layer normalization. Input
texts are first processed by the tokenizer with a maximum length of 256 tokens, then encoded through
ModernBERT. We extract the [CLS] token representation from the final layer as a 768-dimensional text
embedding. During training, we fine-tune the ModernBERT weights end-to-end along with the feature
fusion layers.</p>
        <p>2. Social Media Features (green blocks in Figure 1): We extract 24 hand-crafted features from the
datasets and categorize them into three groups: (i) Text-based features (12) capturing statistical and
stylistic properties including text length, word count, exclamation count, question count, ellipsis count,
uppercase/digit ratios, URL/mention/hashtag counts, average word length, and emotion word count; (ii)
User-based features (7) characterizing the content publisher including verification status, follower
count, friends count, follower-to-friends ratio, status count, account age, and profile description length;
(iii) Social engagement features (5) reflecting post reception including retweet count, favorite count,
retweet status, total engagement, and retweet ratio. All features are standardized using StandardScaler
for zero mean and unit variance. These features are then processed through a feed-forward network: a
128-dimensional linear layer with ReLU activation and Dropout (0.3), followed by a 64-dimensional
linear layer with ReLU activation and Dropout (0.2), producing the final feature representation.</p>
        <p>3. Fusion Layer (Final Classification): We concatenate the 768-dimensional BERT embeddings
and 64-dimensional feature embeddings using a Concat layer, resulting in an 832-dimensional fused
representation. This concatenated representation is processed through: a 256-dimensional linear layer
with ReLU activation and Dropout (0.3), followed by a 2-dimensional linear layer, outputting logits for
binary classification.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Focal Loss Function</title>
        <p>To handle class imbalance during training, we employ Focal Loss [13]:
 () = − (1 −  ) log()
(1)
where  is the model’s estimated probability for the true class,   is a class-dependent weighting
factor, and  is the focusing parameter. We set  = 0.97 and  = 2.5 based on preliminary experiments.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Threshold Optimization</title>
        <p>After model training, we optimize the threshold on the validation set. Instead of using the default
threshold of 0.5, we perform a systematic grid search over thresholds ranging from 0.01 to 0.99 in
increments of 0.01. For each threshold candidate, we evaluate the model’s performance.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiment</title>
      <p>The organizers released two test sets with diferent feature availability. The first test set includes
comprehensive features, while the second test set only includes text content. To accommodate these
diferences, we trained two separate models: the first model combines all social media features with
semantic features, while the second model combines only text-based features with ModernBERT’s
semantic features. Additionally, compared to the first model, the second model uses 13 optimized
text-based features, replacing emotion word count with two new features: multiple exclamation count
and all-caps word count. Moreover, the second model employs a smaller feature extraction network
with dimensions of 64 and 32, compared to the first model’s 128 and 64 dimensions. Consequently, the
fusion layer of the second model concatenates 768-dimensional BERT embeddings with 32-dimensional
feature embeddings, resulting in an 800-dimensional representation instead of the 832 dimensions used
in the first model.</p>
      <sec id="sec-5-1">
        <title>5.1. Experimental Setup</title>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Data Preprocessing</title>
        <sec id="sec-5-2-1">
          <title>5.2.1. Text Cleaning</title>
          <p>Raw social media text contains noise that can hinder model performance. Our text cleaning pipeline
includes: (i) removal of URLs and web links, (ii) removal of special characters while preserving linguistic
characters, (iii) normalization of whitespace, and (iv) conversion to lowercase. For missing or empty
text fields, we use a placeholder “empty_text” to maintain data structure integrity.</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Missing Value Handling</title>
          <p>Due to field inconsistencies between the misinformation and non-misinformation training sets, we
handle missing fields with appropriate default values: for categorical features, we use False as the
default value when the field is absent; for numerical features, we use 0 as the default value.</p>
        </sec>
        <sec id="sec-5-2-3">
          <title>5.2.3. Resampling Strategy</title>
          <p>To address the severe class imbalance, we implement a resampling approach. The resampling is
performed once before training for both models, and the resampled dataset remains fixed throughout
all training epochs. For the first model, we upsample misinformation examples by a factor of 10 using
random sampling with replacement, while for the second model, we upsample misinformation examples
by a factor of 15 using random sampling with replacement.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Training Progress</title>
        <p>Table 4 shows the first model’s performance on the validation set across diferent epochs. The best
model is selected at Epoch 2, achieving an F1-score of 0.1834 with precision of 0.1141 and recall of
0.4679.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Threshold Optimization Results</title>
        <p>For the first model, through systematic grid search on the validation set, we identified the optimal
classification threshold as 0.69. Table 6 presents the model performance on the validation set using this
optimized threshold. The threshold optimization improves the F1-score from 0.1834 (at the default 0.5
threshold) to 0.2151.</p>
        <p>For the second model, we did not use the threshold optimization strategy.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>Table 8 presents the comprehensive evaluation metrics for our second model on the second test set.
Our model achieves a weighted F1-score of 0.82 on the test set with precision of 0.82 and recall of 0.84.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This paper presents a hybrid deep learning approach for misinformation detection that combines
ModernBERT with hand-crafted features derived from textual content, user metadata, and social
engagement metrics. Our methodology addresses the challenges of semantic understanding and extreme
class imbalance through the use of Focal Loss and strategic resampling techniques. The experimental
results demonstrate that our approach achieves weighted F1-scores of 0.97 and 0.82 on the two oficial
test datasets, respectively.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work is supported by the National Social Science Foundation of China (24BYY080).</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Claude in order to assist with English language
refinement and improve paper structure and presentation. The authors did not use generative AI for the
core research methodology, experimental design, or result analysis. After using this tool, the authors
reviewed and edited the content as needed and take full responsibility for the publication’s content.
[6] O. Ajao, D. Bhowmik, S. Zargari, Fake news identification on twitter with hybrid CNN and
RNN models, in: Proceedings of the 9th International Conference on Social Media and Society,
SMSociety 2018, Copenhagen, Denmark, July 18-20, 2018, ACM, 2018, pp. 226–230. doi:10.1145/
3217804.3217917.
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies
(NAACLHLT 2019), Association for Computational Linguistics, 2019, pp. 4171–4186. doi:10.18653/v1/
n19-1423.
[8] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
RoBERTa: A robustly optimized BERT pretraining approach, arXiv preprint arXiv:1907.11692
(2019). URL: https://arxiv.org/abs/1907.11692.
[9] P. He, X. Liu, J. Gao, W. Chen, DeBERTa: Decoding-enhanced BERT with disentangled attention,
in: Proceedings of the 9th International Conference on Learning Representations (ICLR 2021),
OpenReview.net, 2021. URL: https://openreview.net/forum?id=XPZIaotutsD.
[10] C. Castillo, M. Mendoza, B. Poblete, Information Credibility on Twitter, in: Proceedings of the
20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28
April 1, 2011, ACM, 2011, pp. 675–684. doi:10.1145/1963405.1963500.
[11] G. K. Shahi, Y. Mejova, Too little, too late: Moderation of misinformation around the
russoukrainian conflict, Websci ’25, 2025. doi:10.1145/3717867.3717876.
[12] G. K. Shahi, T. A. Majchrzak, AMUSED: An annotation framework of multimodal social media
data, in: F. Sanfilippo, O.-C. Granmo, S. Y. Yayilgan, I. S. Bajwa (Eds.), Intelligent Technologies and
Applications, Springer International Publishing, Cham, 2022, pp. 287–299.
[13] T.-Y. Lin, P. Goyal, R. B. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: IEEE
International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, IEEE
Computer Society, 2017, pp. 2999–3007. doi:10.1109/ICCV.2017.324.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zafarani</surname>
          </string-name>
          ,
          <article-title>A survey of fake news: Fundamental theories, detection methods, and opportunities</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>53</volume>
          (
          <year>2021</year>
          )
          <volume>109</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>109</lpage>
          :
          <fpage>40</fpage>
          . doi:
          <volume>10</volume>
          .1145/3395046.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Shasirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          , G. Pasi, T. Mandl,
          <article-title>Prompt recovery for misinformation detection at fire 2025, in: Proceedings of the 17th Annual Meeting of the Forum for Information Retrieval Evaluation</article-title>
          , FIRE '25,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Shasirekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          , G. Pasi, T. Mandl,
          <article-title>Overview of the first shared task on prompt recovery for misinformation detection</article-title>
          (promid
          <year>2025</year>
          ), in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Chakraborty (Eds.), Working Notes of FIRE 2025 -
          <article-title>Forum for Information Retrieval Evaluation, Varanasi, India</article-title>
          .
          <source>December 17-20</source>
          ,
          <year>2025</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Syntactic stylometry for deception detection, in: The 50th Annual Meeting of the Association for Computational Linguistics</article-title>
          ,
          <source>Proceedings of the Conference, July 8-14</source>
          ,
          <year>2012</year>
          ,
          <string-name>
            <given-names>Jeju</given-names>
            <surname>Island</surname>
          </string-name>
          , Korea - Volume
          <volume>2</volume>
          :
          <string-name>
            <given-names>Short</given-names>
            <surname>Papers</surname>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>171</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Pérez-Rosas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lefevre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          ,
          <article-title>Automatic detection of fake news</article-title>
          ,
          <source>in: Proceedings of the 27th International Conference on Computational Linguistics</source>
          ,
          <string-name>
            <surname>COLING</surname>
          </string-name>
          <year>2018</year>
          ,
          <string-name>
            <given-names>Santa</given-names>
            <surname>Fe</surname>
          </string-name>
          , New Mexico, USA,
          <year>August</year>
          20-
          <issue>26</issue>
          ,
          <year>2018</year>
          , Association for Computational Linguistics,
          <year>2018</year>
          , pp.
          <fpage>3391</fpage>
          -
          <lpage>3401</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>