<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Meme Classification using ModernBERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Ibrahim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Engineering Department, Cairo University</institution>
          ,
          <addr-line>1 Gamaa Street, 12613, Giza</addr-line>
          ,
          <country country="EG">Egypt</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The rapid proliferation of memes on social media necessitates robust automated systems for detecting harmful content, particularly in linguistically diverse contexts like Mexican Spanish. This paper presents a benchmark study for the DIMEMEX shared task at IberLEF 2025, leveraging ModernBERT, a transformer model enhanced with rotary positional embeddings (RoPE) and FlashAttention, to classify memes into three categories (hate speech, inappropriate content, neither) and subcategorize hate speech into classism, sexism, racism, and others. Our hierarchical multi-task learning framework achieved macro-F1 scores of 0.44 on Subtask 1 and 0.26 on Subtask 2, underscoring the challenges of fine-grained classification in low-resource, culturally nuanced settings. This work demonstrates ModernBERT's potential for long-context understanding while highlighting the need for multimodal approaches. Future research should prioritize synthetic data generation to address label scarcity, integrate vision-language architectures for joint text-image modeling, and refine ethical safeguards against cultural bias.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Meme Classification</kwd>
        <kwd>ModernBERT</kwd>
        <kwd>Text Classification</kwd>
        <kwd>Transformer models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        1. Introduction
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] The exponential growth of user-generated content on social media platforms has intensified the need
for automated systems capable of detecting inappropriate or harmful language. Memes, a prevalent
form of social media content that combines images and text, often convey complex cultural and
linguistic nuances, making the task of identifying inappropriate content particularly challenging. These
multimodal artifacts require models to integrate visual and textual semantics while decoding implicit
cultural references, a task that remains understudied in low-resource languages like Mexican Spanish
The DIMEMEX shared task at IberLEF 2025 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] addresses this challenge by focusing on the detection
of inappropriate memes from Mexico, with subtasks that emphasize the classification of the textual
content embedded in memes. Specifically, the first two subtasks involve (1) determining whether a
meme’s text is inappropriate and (2) categorizing the type of inappropriateness, requiring models that
can understand subtle linguistic cues and cultural context in Spanish.
      </p>
      <p>
        Recent advances in Natural Language Processing (NLP) have been driven by the emergence of
transformer-based models, with BERT (Bidirectional Encoder Representations from Transformers) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
marking a major milestone. BERT’s bidirectional context modeling and pretraining on large corpora
have significantly improved performance across a wide range of text classification tasks, including
sentiment analysis, hate speech detection, and topic categorization [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, the original BERT
architecture has limitations in eficiency and handling long contextual sequences, which are often
necessary for understanding complex and nuanced texts such as memes. For instance, BERT’s fixed
512-token window struggles with memes where sarcasm or hate speech emerges from the interplay of
text and image over extended contexts.
      </p>
      <p>
        ModernBERT, a recent evolution of BERT, incorporates several architectural and training
optimizations that address these limitations, including rotary positional embeddings (RoPE) for dynamic
positional encoding, Flash Attention 2 for accelerated computation, and extended context windows
of up to 8,192 tokens [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These improvements enable ModernBERT to process longer texts more
efectively and with greater computational eficiency, making it well-suited for tasks involving detailed
semantic understanding and fine-grained classification. Studies have demonstrated that ModernBERT
outperforms conventional BERT models in various domains, including medical text classification (e.g.,
Clinical ModernBERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) and long-context retrieval tasks, without sacrificing accuracy.
      </p>
      <p>
        The application of ModernBERT to text classification tasks, especially in low-resource or
domainspecific settings such as Spanish-language memes, benefits from transfer learning and fine-tuning
strategies that adapt the model to the target data distribution. For example, synthetic data generation
using large language models (LLMs) like GPT-4 has been shown to enhance ModernBERT’s
performance in low-resource scenarios, achieving F1 scores of 0.89 on domain classification tasks with only
1,000 synthetic examples [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Fine-tuning ModernBERT on domain-specific datasets, such as clinical
narratives or social media corpora, further enhances its ability to generalize and detect subtle forms
of inappropriate content. Moreover, hybrid approaches that combine ModernBERT embeddings with
multimodal architectures (e.g., CLIP, UNITER) or convolutional neural networks (CNNs) have been
proposed to capture both global context and local textual features, improving classification accuracy in
hateful meme detection [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>The DIMEMEX shared task provides a unique benchmark for evaluating these approaches in the
context of Mexican Spanish memes, where cultural and linguistic specificities pose additional challenges.
For instance, Mexican memes often employ regional slang, code-mixing, and historical references
that require models to internalize both language and cultural knowledge. Leveraging ModernBERT’s
capabilities, this paper explores its efectiveness in the first two subtasks of DIMEMEX, aiming to
achieve robust detection and categorization of inappropriate meme texts.</p>
      <p>The remainder of this paper is organized as follows. Section 2 reviews related work on text
classification, transformer architectures, and inappropriate content detection. Section 3 details our methodology,
including dataset preprocessing and hyperparameter configurations, Section 4 discusses the results,
and Section 5 concludes the work with future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Text classification has been a foundational task in NLP, evolving significantly from traditional machine
learning methods to modern deep learning and transformer-based approaches. The first two subtasks of
the DIMEMEX shared task at IberLEF 2025, focus on detecting and categorizing inappropriate memes,
build upon this rich research landscape. This section elaborates on the progression of methodologies,
highlighting the role of ModernBERT and related models in state-of-the-art text classification.</p>
      <p>
        Historically, text classification relied on classical machine learning models such as Support Vector
Machines (SVM) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Naive Bayes [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Decision Trees, and Random Forests [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], often using
bag-ofwords or TF-IDF features. While efective for simpler tasks, these approaches struggled with capturing
semantic context and nuances in language, particularly in noisy or informal text such as social media
posts or memes.
      </p>
      <p>
        The introduction of neural networks, especially convolutional neural networks (CNNs) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and
recurrent neural networks (RNNs) [12], marked a significant improvement by learning dense
representations and sequential dependencies in text . Hybrid models combining CNNs and bidirectional
LSTMs [13] have shown strong performance on benchmark datasets like SST-2 and AG News. However,
these architectures still faced challenges in modeling long-range dependencies and complex contextual
relationships.
      </p>
      <p>
        The transformer architecture [14] revolutionized NLP by enabling models to attend globally to
input sequences, overcoming the limitations of RNNs. BERT (Bidirectional Encoder Representations
from Transformers) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] further advanced the field by pretraining deep bidirectional representations
on massive corpora and fine-tuning on downstream tasks. BERT’s ability to capture rich contextual
information bidirectionally has made it the backbone for many text classification tasks, including
ofensive language and hate speech detection.
      </p>
      <p>
        In multilingual and Spanish-specific contexts, models like mBERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and BETO [15] have been
adapted and fine-tuned, demonstrating strong performance in Iberian languages [16]. The DIMEMEX
task leverages these advances by applying ModernBERT, a refined variant of BERT with optimized
pretraining and fine-tuning strategies, to classify meme texts for inappropriateness [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        ModernBERT represents the next generation of BERT-based models, incorporating improvements
such as longer context windows, more eficient training objectives, and domain-adaptive pretraining
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For example, llm-jp-modernbert [17] extends the context length to 8192 tokens, enabling better
handling of long documents . These enhancements translate into superior performance on classification
tasks, especially in domains requiring nuanced understanding.
      </p>
      <p>Recent studies have also combined BERT with CNN classifiers to exploit local feature extraction
alongside contextual embeddings, yielding improved accuracy in sentiment analysis and social media
text classification. Hierarchical BERT models [ 18] have been proposed to handle multilevel classification
efectively, as demonstrated in e-Commerce comment classification, where parent and subclass BERT
models are trained sequentially to enhance granularity and accuracy.</p>
      <p>Despite the dominance of transformer models, recent comparative studies have revealed that simpler
models like logistic regression or SVM with optimized n-gram features can sometimes outperform more
complex architectures, particularly when hyperparameter tuning is insuficient . This underscores the
importance of rigorous optimization during fine-tuning, including learning rate schedules, batch sizes,
and early stopping criteria [19].</p>
      <p>Moreover, discriminative encoder-only models like BERT consistently outperform generative
decoderonly models (e.g., GPT) on supervised classification tasks [ 19]. Transfer learning and cross-validation
techniques have been instrumental in maximizing BERT’s efectiveness for multi-class classification, as
evidenced in experiments on datasets such as 20 Newsgroups and Reuters.</p>
      <p>
        The detection of inappropriate or harmful content, especially in social media and memes, has been
an active area of research. Multimodal approaches combining textual and visual features have been
explored, but text-only models based on BERT remain highly competitive [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The DIMEMEX shared
task is a notable benchmark focusing on culturally specific memes from Mexico, challenging models to
detect subtle forms of inappropriateness and categorize them accurately [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The success of ModernBERT in this context is supported by its ability to capture complex semantic cues
and contextual dependencies, which are critical for distinguishing nuanced categories of inappropriate
content. Its fine-tuning on domain-specific data, combined with hierarchical classification strategies,
aligns with best practices identified in recent literature.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset and Label Distribution</title>
        <p>
          The Detection of Inappropriate Memes from Mexico (DIMEMEX) dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] provides a benchmark
corpus composed of Mexican Spanish memes, annotated with two levels of classification. The first
level (Subtask 1) involves a three-way classification: identifying each meme as either hate speech,
inappropriate content, or neither. The second level (Subtask 2) applies only to those memes labeled as
hate speech and involves the detection of specific subcategories of hate speech.
        </p>
        <p>Subtask 2 comprises six independent binary classification tasks, each corresponding to one hate
speech subtype: classism, racism, sexism, and other hate speech. Each meme can belong to one or
several of these categories, motivating the need for a fine-grained, multi-aspect approach. The label
distribution across these subcategories is highly imbalanced, further complicating efective training
and generalization.</p>
        <p>In Subtask 1, the label proportions are distributed as follows: 62% “neither,” 23% “inappropriate
content,” and 15% “hate speech.” In Subtask 2, among hate speech instances, the six subcategories are
sparsely populated, with classism and racism being the most prevalent. Due to this imbalance and task
segmentation, we employed targeted strategies for data balancing and architecture design.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Input Representation and Preprocessing</title>
        <p>Each meme instance includes a description field that contains the text intended for analysis. This
ifeld, provided as part of the dataset, serves as the sole input to the model. No image data or
OCRextracted captions were used.</p>
        <p>Text inputs were tokenized using the ModernBERT tokenizer, capable of handling sequences up
to 8192 tokens. This capacity was particularly useful for memes containing verbose or contextually
dense language. Preprocessing steps included lowercasing, normalization of punctuation, removal of
URLs and emojis, and stripping of non-linguistic characters. We deliberately avoided stemming and
lemmatization to preserve sociolinguistic signals such as slang, colloquial phrasing, and regional idioms,
which are crucial for detecting cultural nuance in hate speech.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model Architecture</title>
        <p>The architecture is based on ModernBERT, specifically the jorgeortizfuentes/tulio-modernbert-spanish
variant. This model incorporates two key enhancements: Rotary Positional Embeddings (RoPE) for
improved long-range dependency modeling, and FlashAttention2 for accelerated and memory-eficient
attention computation.</p>
        <p>For Subtask 1, a standard softmax classification head was added on top of the [CLS] token output to
predict one of the three mutually exclusive classes. This head was trained using categorical cross-entropy
loss.</p>
        <p>For Subtask 2, six independent binary classifiers were constructed, each corresponding to one
of the hate speech subcategories. These classifiers were trained separately on filtered subsets of the
data, where only the hate speech-labeled memes were included. Each classifier operates independently,
receiving the shared ModernBERT-encoded [CLS] token as input, and applies a sigmoid activation to
output the probability of the specific subcategory. This modular, one-vs-rest setup ensured the capacity
for nuanced pattern recognition without interference between labels.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Training Configuration</title>
        <p>Fine-tuning was carried out using the following configuration:
• Pretrained Model: jorgeortizfuentes/tulio-modernbert-spanish
• Tokenizer: ModernBERT tokenizer (max length: 8192 tokens)
• Optimizer: AdamW
• Learning Rate: 5e-5, with linear warmup over the first 10% of training steps
• Batch Size: 16
• Epochs: 5
• Loss Function: Cross-entropy for all tasks (multi-class for Subtask 1, binary for Subtask 2
classifiers)
• Hardware: NVIDIA T4 GPU with 16GB memory</p>
        <p>To address class imbalance, we used inverse-frequency class weighting, combined with random
oversampling of minority classes during training. All experiments followed a stratified 5-fold
crossvalidation setup, with 80/20 train-validation splits maintained within each fold. Early stopping was
disabled to allow full training cycles, and the model checkpoint with the highest validation macro-F1
score was retained.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Implementation and Inference</title>
        <p>The implementation leveraged PyTorch and Hugging Face’s Transformers library. Mixed-precision
training (FP16) was enabled to reduce memory consumption and accelerate training. For inference, the
model architecture was optimized for deployment on a single NVIDIA T4 GPU, supporting real-time
classification in content moderation pipelines.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We evaluated our ModernBERT-based model across diferent training configurations for both subtasks
in the DIMEMEX challenge using 5-fold cross-validation. Table 1 summarizes the average macro-F1,
precision, and recall scores on the validation sets.</p>
      <p>For Subtask 1, our best configuration combined class-weighted loss and oversampling of minority
classes. This configuration yielded the highest macro-F1 score of 0.49, demonstrating notable
improvements over the baseline. Oversampling especially boosted the recall of the hate speech class, which was
severely underrepresented.</p>
      <p>Subtask 2 posed greater challenges. Despite leveraging domain-specific pretraining and targeted
ifne-tuning, performance plateaued at a macro-F1 score of 0.28. This outcome reflects both the
inherent dificulty of distinguishing between hate speech subtypes and the compounding efect of error
propagation from Subtask 1.</p>
      <p>We also monitored training and validation curves across all folds. Loss convergence was stable,
though signs of overfitting began appearing around epoch 4 in the baseline setup. Early stopping was
not required in the best-performing runs, and performance gains from oversampling were consistently
observed across folds.</p>
      <p>The results afirm that while ModernBERT’s architecture supports improved contextual
understanding, its performance is bounded by data scarcity, subtle semantics, and label granularity—challenges
that motivate future multimodal and culturally informed approaches.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>This study investigated the application of the ModernBERT architecture to the Detection of Inappropriate
Memes from Mexico (DIMEMEX) shared task, which focuses on categorizing Mexican Spanish memes
into high-level categories (hate speech, inappropriate content, or neither) and further identifying six
distinct subtypes of hate speech. Leveraging rotary positional embeddings and long-context encoding,
ModernBERT was adapted to a hierarchical setup with a multi-class classification head for Subtask 1
and six independent binary classifiers for Subtask 2.</p>
      <p>Despite the architectural advantages of ModernBERT, the model faced considerable challenges
inherent to the task. On the DIMEMEX oficial test set, our best system achieved a macro-F1 score of
0.44 for Subtask 1 and 0.26 averaged across the six binary classifiers in Subtask 2. These modest scores
underscore the dificulty of the problem, which is amplified by extreme class imbalance, the subtleties
of cultural and linguistic expression in memes, and the absence of visual context.</p>
      <p>Key limitations were identified in both the data and the modeling approach. Most notably, the majority
of training examples in Subtask 1 were labeled "neither," limiting the model’s exposure to harmful
content. Additionally, Subtask 2 required the detection of subtle and often overlapping expressions
of hate, which are not always easily separable from informal or regional language. The reliance on
text-only inputs further constrained performance, as memes are inherently multimodal, and critical
information is frequently embedded in the accompanying image.</p>
      <p>To advance this line of research, several promising directions emerge. First, addressing label
imbalance through data augmentation techniques—such as GPT-based paraphrasing, synthetic minority
oversampling, or contrastive learning—may improve recall for underrepresented classes. Second, the
incorporation of visual information via multimodal models (e.g., CLIP, vision transformers) would
provide contextual cues often missing from text alone. Third, the pipeline could be restructured by
decoupling Subtask 1 and Subtask 2 into standalone classifiers, which may reduce cumulative error and
better capture the hierarchical nature of the problem.</p>
      <p>Future work should also explore uncertainty-aware training and active learning, which can prioritize
ambiguous or borderline cases for manual annotation. This could improve the quality of labels,
particularly for the more subjective hate speech subcategories. Lastly, ethical considerations must guide
model design and deployment. Bias mitigation strategies—such as adversarial debiasing, dialect-aware
calibration, and post-hoc fairness audits—are essential to avoid disproportionate moderation of informal,
dialectal, or culturally specific content that is not inherently harmful.</p>
      <p>Overall, while the current results reflect the dificulty of automated meme moderation in low-resource,
culturally rich contexts, they establish a robust foundation for future innovation. Progress in model
architecture, training methodologies, and ethical oversight will be key to developing systems that
balance moderation efectiveness with cultural sensitivity and fairness.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used ChatGPT, Grammarly in order to: Grammar and
spelling check, Paraphrase and reword. After using this tool/service, the author reviewed and edited
the content as needed and take full responsibility for the publication’s content.
[12] S. Lai, L. Xu, K. Liu, J. Zhao, Recurrent convolutional neural networks for text classification, in:</p>
      <p>Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015.
[13] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780.
[14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,</p>
      <p>Attention is all you need, Advances in neural information processing systems 30 (2017).
[15] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and
evaluation data, arXiv preprint arXiv:2308.02976 (2023).
[16] J. Á. González-Barba, L. Chiruzzo, S. M. Jiménez-Zafra, Overview of IberLEF 2025: Natural
Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the
Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS. org, 2025.
[17] I. Sugiura, K. Nakayama, Y. Oda, llm-jp-modernbert: A modernbert model trained on a
largescale japanese corpus with long context length, 2025. URL: https://arxiv.org/abs/2504.15544.
arXiv:2504.15544.
[18] X. Zhang, F. Wei, M. Zhou, Hibert: Document level pre-training of hierarchical bidirectional
transformers for document summarization, arXiv preprint arXiv:1905.06566 (2019).
[19] L. Galke, A. Diera, B. X. Lin, B. Khera, T. Meuser, T. Singhal, F. Karl, A. Scherp, Are we really making
much progress in text classification? a comparative review, arXiv preprint arXiv:2204.03954 (2022).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kapil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ekbal</surname>
          </string-name>
          ,
          <article-title>A transformer based multi task learning approach to multimodal hate speech detection</article-title>
          ,
          <source>Natural Language Processing Journal</source>
          <volume>11</volume>
          (
          <year>2025</year>
          )
          <fpage>100133</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>T.-C. I. H.-F. D. I. E. H. J. V.-P. L. M.-y.-G. M. Jarquín-Vásquez</surname>
          </string-name>
          ,
          <article-title>Horacio, Overview of DIMEMEX at IberLEF2025: Detection of Inappropriate Memes from Mexico</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers</article-title>
          ),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Minaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kalchbrenner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cambria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nikzad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chenaghlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Deep learning-based text classification: a comprehensive review, ACM computing surveys (CSUR) 54 (</article-title>
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Warner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chafin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Clavié</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Weller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hallström</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Taghadouini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gallagher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ladhak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Aarsen</surname>
          </string-name>
          , et al.,
          <article-title>Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory eficient, and long context finetuning and inference</article-title>
          ,
          <source>arXiv preprint arXiv:2412.13663</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Chiang</surname>
          </string-name>
          ,
          <article-title>Clinical modernbert: An eficient and long context encoder for biomedical text</article-title>
          ,
          <source>arXiv preprint arXiv:2504.03964</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Berenstein</surname>
          </string-name>
          ,
          <article-title>Fine-tune modernbert for text classification using synthetic data, 2024</article-title>
          . URL: https: //huggingface.co/blog/davidberenstein1957/fine
          <article-title>-tune-modernbert-on-synthetic-data, hugging Face Blog</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          , et al.,
          <article-title>Transductive inference for text classification using support vector machines</article-title>
          ,
          <source>in: Icml</source>
          , volume
          <volume>99</volume>
          ,
          <year>1999</year>
          , pp.
          <fpage>200</fpage>
          -
          <lpage>209</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nigam</surname>
          </string-name>
          , et al.,
          <article-title>A comparison of event models for naive bayes text classification</article-title>
          ,
          <source>in: AAAI-98 workshop on learning for text categorization</source>
          , volume
          <volume>752</volume>
          ,
          <string-name>
            <surname>Madison</surname>
            ,
            <given-names>WI</given-names>
          </string-name>
          ,
          <year>1998</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          , Random forests,
          <source>Machine learning 45</source>
          (
          <year>2001</year>
          )
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Convolutional neural networks for sentence classiication</article-title>
          .
          <source>emnlp</source>
          <year>2014</year>
          -
          <article-title>2014 conference on empirical methods in natural language processing</article-title>
          ,
          <source>in: Proceedings of the Conference null, null</source>
          (
          <year>2014</year>
          ),
          <year>1746ś1751</year>
          . https://doi. org/10.3115/v1/d14-
          <fpage>1181</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>