<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PFLex: Perturbation-Free Local Explanations in Language Model-Based Text Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yogachandran Rahulamathavan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Misbah Farooq</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Varuna De Silva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Digital Technologies, Loughborough University</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large Language Models (LLMs) excel at text classification but remain dificult to interpret. Traditional methods like Local Interpretable Model-Agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP) rely on input perturbations, requiring thousands of model passes, which makes them computationally expensive and unscalable for large models. To address this, we propose a structured learning framework that estimates word importance using a Siamese neural network, eliminating the need for perturbations. Our approach generates one-shot explanations, reducing computation by four orders of magnitude for BERT. Evaluated on an emotion classification and depression classification tasks, it achieves over 90% agreement with LIME. It demonstrates strong robustness, ofering a scalable alternative to explain language model-based classification tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1.1. Motivation</title>
        <p>Transformers and LLMs have revolutionized NLP by using self-attention to model long-range
dependencies [2]. They process sequences in parallel, capturing contextual relationships via multi-head attention.
Word tokens, mapped to dense embeddings, represent semantic properties. Special tokens like [CLS]
aggregate sentence information. Through transformer layers, embeddings evolve, refining the model’s
decisions.
5
LLM
Model</p>
        <p>Classification Result</p>
        <p>= “joy”
Classification
Results for</p>
        <p>Perturb</p>
        <p>Sentences
LIME
6</p>
        <p>Word
Importances</p>
        <p>“i love
explainable</p>
        <p>AI”
LLM
Model</p>
        <p>Classification Result</p>
        <p>= “joy”
4</p>
        <p>Embeddings
for [CLS] and</p>
        <p>[Words]
PFLex
5</p>
        <p>Word
Importances</p>
        <p>“i love
explainable</p>
        <p>AI”
Input=
“i love
explainable</p>
        <p>AI”
3
Input=
“i love
explainable</p>
        <p>AI”
3</p>
        <p>Perturb
4 Sentences
1</p>
        <p>Intuitively, words critical to the classification task (i.e., emotion) should be more aligned with the
[CLS] embedding. In the example sentence provided earlier, the distance (i.e., Euclidean) between the
[CLS] embedding and the embeddings of the tokens [susceptible] and [insecure] should be smaller
than the distance between [CLS] and other tokens’ embeddings such as [having]. To validate this
hypothesis, we randomly selected 100 sentences with diferent emotions. Then we obtain the LIME
word importance score for all the words in each sentence. For the words with higher importance
scores, we measured the Euclidean distance between the word embeddings and the [CLS] embedding at
various layers of a fine-tuned BERT model. As depicted in Figure 3, in the first layer, words identified as
important by LIME tend to have greater distances from the [CLS] token. However, in the final layer, the
same high-importance words cluster much closer to the [CLS] token, supporting our intuition.</p>
        <p>While this observation supports our intuition, directly applying Euclidean distance to the embeddings
is insuficient to extract the word importances perfectly. The relationship between word embeddings and
classification decisions is non-linear. However, we can train a neural network to capture the non-linear
relationship and extract the hidden word importance scores. The aim of the neural network should be
to increase the word importance score for important words and decrease the word importance score for
other words.</p>
        <p>A Siamese neural network architecture [6] is particularly well-suited for this task. By employing
two identical subnetworks, the Siamese architecture processes both the [CLS] token embedding and
individual word embeddings in a shared representation space. The network is trained to maximize
similarity for words with high feature importance while minimizing similarity for less relevant words. To
validate the idea, we evaluated our approach on fine-tuned BERT models for emotion classification [ 18]
and depression classification [ 19] using the Twitter sentiment dataset1 and Reddit Depression Dataset 2.
The experimental results show more than 90% agreement with the LIME-based word importance score
while improving eficiency by four orders of magnitude for BERT model. It should be noted that the
1Elkomy, A. (2024). Twitter Emotion Dataset: Unveiling the Emotional Tapestry of Social Media. Available at: https:
//www.kaggle.com/datasets/adhamelkomy/twitter-emotion-dataset/data.
2Depression: Reddit Dataset (Cleaned) (2020), https://www.kaggle.com/datasets/infamouscoder/depression-reddit-cleaned.
savings would increase with the larger models. Stress tests further validate its robustness, making it a
scalable alternative for LLM explainability.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Review</title>
      <sec id="sec-2-1">
        <title>We review three types of XAI approaches in this section.</title>
        <sec id="sec-2-1-1">
          <title>2.1. Feature Attribution-Based Explanations</title>
          <p>Feature attribution methods quantify the contribution of individual input words to a model’s prediction.
Prominent techniques include LIME [3] and SHAP [4], which have been widely adopted in text-based
decision systems [7]. These methods typically operate by perturbing inputs or applying game-theoretic
principles to assess feature importance. For instance, LIME [3] generates perturbed variations of the
original input by omitting or replacing words and observes how these changes afect predictions. It then
ifts an interpretable surrogate model to approximate the original model’s decision boundary. SHAP [ 4],
on the other hand, leverages Shapley values to attribute importance scores based on cooperative game
theory, capturing both individual and interaction efects of words. Despite their efectiveness, these
methods are computationally intensive. LIME and SHAP require the generation of multiple perturbed
samples per input, leading to high inference costs, particularly for large-scale LLMs.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.2. Example-Based Explanations</title>
          <p>This approach includes adversarial examples and counterfactual explanations. As studied in [8],
adversarial examples involve minimally altered inputs that cause misclassification, exposing model
vulnerabilities. While valuable, these methods face challenges in computational cost and example quality.
TEXTFOOLER [8], for instance, generates adversarial inputs via synonym substitution, which can be
costly. Similarly, crafting meaningful counterfactual explanations requires careful input modifications
to ensure interpretability. Recent advancements, such as Uni-Modal Event-Agnostic Knowledge
Distillation (UEKD) [9] for multimodal fake news detection and LLM Sentinel (LLAMOS) [10] for adversarial
defense, have improved the robustness of example-based explanations. Interactive XAI systems like
TalkToModel [12] also enhance user understanding by facilitating human-model interactions, though
ensuring explanation validity remains a challenge.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>2.3. Attention-Based Explanations</title>
          <p>Attention-based explanations leverage the inherent attention mechanisms within transformer models
to provide insights into their decision-making process. These methods typically utilize attention
weights to highlight influential features. For instance, [ 13] proposes a text classification method that
combines keyword-based approaches with attention mechanisms. Similarly, AttentionViz [14] leverages
attention patterns to reveal relationships within the model. However, recent research has highlighted
the limitations of relying solely on attention weights for faithful explanations. The study in [15] has
demonstrated that attention weights may not always accurately reflect the true importance of input
features and can even be misleading in certain cases [16]. These limitations stem from the fact that
attention weights can encode information beyond feature importance, leading to misinterpretations
[17].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Training Data Construction</title>
        <p>To train a perturbation-free word importance model, we use LIME-generated scores as ground truth.
Sentences are passed through a fine-tuned BERT for emotion predictions. LIME perturbs words, trains a
surrogate model, and assigns importance scores. These scores are used as labels. Sentences are converted
to word-CLS embedding pairs, framing the task as word similarity estimation. With approximately
1000 sentences per class and the average length of 30 words, the dataset contains around 200,000 word
embeddings for emotion classification and 60,000 for depression classification.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Siamese Network Architecture and Training</title>
        <p>Let us denote the [CLS] token embedding as h, word embeddings as h and word importance score as
 . To model the relationship between word embeddings and their importance, we employ a Siamese
network [6]. As shown in Figure 4, the network consists of two identical subnetworks that transform
the [CLS] token embedding and word embeddings into a shared representation space. The objective is
to maximize similarity for words with high importance and minimize similarity for less relevant words.
Our network consists of two fully connected layers with ReLU activation and dropout for regularization.
The transformation function is given by:</p>
        <p>e = W(h), e = W(h).</p>
        <p>
          The similarity between transformed embeddings is computed using Cosine similarity:
sim(e, e) = e · e . (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
‖e‖‖e‖
To train the network, we use GRPO. GRPO is a reinforcement learning-inspired optimization strategy
designed to stabilize learning and improve convergence in policy-based learning tasks. In our context,
GRPO is employed to fine-tune the Siamese network such that words with higher feature importance
scores align more closely with the [CLS] token in the learned representation space. The loss function
we used to train the model is defined as follows:
ℒ = −
 
∑︁  () · sim(e, e) +  ∑︁ sim(e, e)2.
=1 =1
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(3)
where  is a granularity factor which is selected as 1 in this context and  denotes the group of words
selected for a given epoch. The first term in the loss function attempts to increase  () · sim(e, e).
SAD
LOVE
        </p>
        <p>JOY</p>
        <p>Therefore, if the word is important for the classification (i.e.,  () is positive) then the network optimises
the weights such that sim(e, e) is positive. On the other hand, if the word is not important for
the classification (i.e.,  () is negative) then the neural network weights are optimised such that the
sim(e, e) is negative. Due to the requirement for both positive and negative similarity scores, Cosine
similarity was employed. However, to mitigate the tendency of similarity scores to converge towards
extreme values of +1 or − 1 during loss minimization, a regularization term,  ∑︀=1 sim(e, e)2,
was incorporated into the loss function (Equation 3). This term serves to discourage the attainment of
maximum or minimum similarity values, thereby stabilizing the training process.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>
        We validated our method using fine-tuned BERT models [ 18, 19] (110 parameters, 12 layers, 93.4%
emotion, 98.8% depression accuracy). Datasets: Twitter emotion (6 classes, 1000/200 train/test per
class, ∼ 30 words/Tweet) and Reddit depression (2 classes, 800/200 train/test per class). CLS-word
embedding pairs with LIME-derived importance scores (− 1 to 1) were created (∼ 200, 000 emotion,
∼ 50, 000 depression embeddings). The Siamese network (two subnetworks, 784 → 512 → 128 →
64 layers, ReLU, 20% dropout) transformed embeddings into a latent space. Cosine similarity (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) was
computed and optimized using loss (3), Adam (1 × 10− 4 learning rate), and 300 epochs. Feature
extraction and training (6000 sentences) took ∼ 1 hour each on a 16GB RAM, RTX 2080 GPU system.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results</title>
      <p>To qualitatively assess our method, we generate heatmaps illustrating word importance as determined
by PFLex for selected test sentences (Figure 5). The visual representation reveals a clear correspondence
with PFLex’s importance scores. To quantify the proposed approach, we perform a stress test comparing
PFLex against LIME below.</p>
      <sec id="sec-5-1">
        <title>5.1. Stress test</title>
        <p>A stress test evaluates feature importance faithfulness by perturbing input data and observing its impact
on model predictions. In text classification, this involves removing important words (e.g., by LIME
or PFLex) and measuring the accuracy drop. Significant degradation upon removal indicates genuine
feature importance. We performed the stress test to measure the faithfulness of the proposed PFLex
approach. As shown in Figure 6, the removal of the most important features leads to a sharp decline in
overall accuracy for both tasks. With all features present, the original model achieves a high accuracy
of 93% and 98% for emotion and depression tasks. However, for the emotion task, when the single
most important word is removed, accuracy drops drastically to 19.77% using LIME and 34.20% using</p>
        <p>PFLex. This pattern continues with additional removals, reinforcing the critical role these top-ranked
words play in determining model predictions. A similar pattern was observed for the depression task,
validates the efectiveness of PFLex as a perturbation-free alternative to LIME.
5.2. Evaluating Alignment Between LIME and PFLex Feature Importance Scores
To assess the agreement between LIME and PFLex feature importance rankings, we computed cosine
similarity scores between their word-level importance values. The analysis was conducted under
varying levels of feature selection, progressively filtering out less significant words to focus on the
most impactful ones. The histograms in Figure 7 illustrate how this similarity evolves across diferent
ifltering thresholds. When all features are considered, the cosine similarity between LIME and PFLex
exhibits a wider spread (showing only 60% correlation), indicating moderate alignment. However,
when we consider the top 10% of words with the highest absolute importance scores—the correlation
reaches more than 90% between the two methods. This supports the hypothesis that PFLex efectively
identifies the most crucial features in a manner that closely aligns with LIME, particularly for the most
influential words in a sentence.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.3. Complexity Comparision</title>
        <p>Figure 8 shows the comparison between LIME and PFLex in terms of execution time and computational
cost. In terms of execution time, LIME exhibits a substantial processing delay due to its
perturbationbased approach. For small sentences (10 words), LIME takes approximately 5 seconds, whereas PFLex
completes the explanation in just 0.52 seconds, achieving nearly a 10-time speedup. This disparity
becomes even more pronounced as sentence length increases. For medium-length sentences (20 words),
LIME requires 7.13 seconds, while PFLex remains highly eficient at 0.54 seconds. The most striking
diference occurs for long sentences (30 words), where LIME takes an overwhelming 40.77 seconds,
whereas PFLex maintains a stable processing time of just 0.55 seconds.</p>
        <p>The computational cost analysis, shown in the second bar chart, further emphasizes the advantage of
PFLex</p>
        <p>PFLex
PFLex over LIME. LIME requires a substantial number of FLOPs due to the repeated inference steps
needed to generate perturbed samples. For small sentences, LIME requires 261 FLOPs, while PFLex
completes the task with just 1.02 FLOPs, representing a reduction of over 99% in computational
complexity. This eficiency gain is even more pronounced for long sentences, where LIME demands
11, 147 FLOPs, compared to only 2.72 FLOPs for PFLex. Overall, the results demonstrate that PFLex
ofers a significantly more scalable and computationally eficient solution compared to LIME.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Works</title>
      <p>We introduced PFLex, a perturbation-free method for word-level feature importance in LLMs, using a
Siamese network to directly map embeddings to importance scores. PFLex achieves LIME-comparable
feature attribution with orders-of-magnitude lower computational cost. Quantitative, qualitative, and
stress tests validate PFLex’s efectiveness, showing high agreement with LIME and robustness. Analysis
of [CLS] embeddings supports our approach’s theoretical basis.</p>
      <sec id="sec-6-1">
        <title>6.1. Future Works</title>
        <p>Despite its strong performance, there remain areas for further improvement. One limitation is that PFLex
relies on precomputed feature importance scores from LIME during training, which may introduce biases
from perturbation-based methods. Future research will explore alternative self-supervised objectives to
learn feature importance directly from the model’s internal representations without requiring external
supervision.</p>
        <p>By bridging the gap between computational eficiency and interpretability, PFLex presents a promising
direction for scalable, real-time explainability in LLMs. Future developments in this space could lead to
even more lightweight and adaptable XAI techniques, ensuring that explainability remains accessible
and practical for modern NLP applications.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>The author has not employed any Generative AI tools.</title>
        <p>[3] Ribeiro, M.T., Singh, S. and Guestrin, C., 2016, August. “Why should i trust you?" Explaining the
predictions of any classifier. In Proc. the 22nd ACM SIGKDD Int’l Conf. knowledge discovery and
data mining (pp. 1135-1144).
[4] Lundberg, S., 2017. A unified approach to interpreting model predictions. arXiv preprint
arXiv:1705.07874.
[5] Devlin, J., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.</p>
        <p>arXiv preprint arXiv:1810.04805.
[6] Bromley, J., Guyon, I., LeCun, Y., Säckinger, E. and Shah, R., 1993. Signature verification using a"
siamese" time delay neural network. Advances in neural information processing systems, 6.
[7] Li, J., Zhang, Y., Karas, Z., McMillan, C., Leach, K., and Huang, Y. (2024, April). Do Machines and
Humans Focus on Similar Code? Exploring Explainability of Large Language Models in Code
Summarization. In Proc. the 32nd IEEE/ACM Int’l Conf. on Program Comprehension (pp. 47-51).
[8] Jin, D., Jin, Z., Zhou, J. T., and Szolovits, P. (2020, April). Is BERT really robust? a strong baseline
for natural language attack on text classification and entailment. In Proc. the AAAI conference on
artificial intelligence (Vol. 34, No. 05, pp. 8018-8025).
[9] Liu, G., Zhang, J., Liu, Q., Wu, J., Wu, S., and Wang, L. (2024). Uni-Modal Event-Agnostic Knowledge</p>
        <p>Distillation for Multimodal Fake News Detection. IEEE Trans. Knowledge and Data Engineering.
[10] Lin, G., and Zhao, Q. (2024). Large Language Model Sentinel: Advancing Adversarial Robustness
by LLM Agent. arXiv preprint arXiv:2405.20770.
[11] Goldshmidt, R. and Horovicz, M., 2024. TokenSHAP: Interpreting Large Language Models with</p>
        <p>
          Monte Carlo Shapley Value Estimation. arXiv preprint arXiv:2407.10114.
[12] Slack, D., Krishna, S., Lakkaraju, H., and Singh, S. (2023). Explaining machine learning models
with interactive natural language conversations using TalkToModel. Nature Machine Intelligence,
5(8), 873-883.
[13] Du, C., and Huang, L. (2018). Text classification research with attention-based recurrent neural
networks. International Journal of Computers Communications and Control, 13(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), 50-61.
[14] Yeh, C., Chen, Y., Wu, A., Chen, C., Viégas, F., and Wattenberg, M. (2023). Attentionviz: A global
view of transformer attention. IEEE Trans. Visualization and Computer Graphics.
[15] Shu, K., Cui, L., Wang, S., Lee, D., and Liu, H. (2019, July). defend: Explainable fake news detection.
        </p>
        <p>In Proc. the 25th ACM SIGKDD Int’l Conf. on knowledge discovery and data mining (pp. 395-405).
[16] Arous et al. (2021, May). Marta: Leveraging human rationales for explainable text classification. In</p>
        <p>Proc. the AAAI conference on artificial intelligence (Vol. 35, No. 7, pp. 5868-5876).
[17] Chrysostomou., G and Aletras., N. 2021. Improving the faithfulness of attention-based explanations
with task-specific information for text classification. In Proc. 59th Annual Meeting of the Association
for Computational Linguistics and the 11th Int’l Joint Conf. Natural Language Processing (Volume
1: Long Papers). Association for Computational Linguistics, Online, 477–488.
[18] Savani, B. (2024). Emotion Classifier. Available at: https://huggingface.co/bhadresh-savani/
bert-base-uncased-emotion.
[19] Jy46604790. (2024). Fake News Detect. Available at: https://huggingface.co/jy46604790/
Fake-News-Bert-Detect.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Zhao</surname>
          </string-name>
          et al.
          <year>2024</year>
          .
          <article-title>Explainability for large language models: A survey</article-title>
          .
          <source>ACM Trans. Intelligent Systems and Technology</source>
          ,
          <volume>15</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>