<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EvalxNLP: A Framework for Benchmarking Post-Hoc Explainability Methods on NLP Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mahdi Dhaini</string-name>
          <email>mahdi.dhaini@tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kafaite Zahra Hussain</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Efstratios Zaradoukas</string-name>
          <email>efstratios.zaradoukas@tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gjergji Kasneci</string-name>
          <email>gjergji.kasneci@tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technical University of Munich, School of Computation, Information and Technology, Department of Computer Science</institution>
          ,
          <addr-line>Boltzmannstr. 3, Garching, 85748</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>As Natural Language Processing (NLP) models continue to evolve and become integral to high-stakes applications, ensuring their interpretability remains a critical challenge. Given the growing variety of explainability methods and diverse stakeholder requirements, frameworks that help stakeholders select appropriate explanations tailored to their specific use cases are increasingly important. To address this need, we introduce EvalxNLP, a Python framework for benchmarking state-of-the-art feature attribution methods for transformer-based NLP models. EvalxNLP integrates eight widely recognized explainability techniques from the Explainable AI (XAI) literature, enabling users to generate and evaluate explanations based on key properties such as faithfulness, plausibility, and complexity. Our framework also provides interactive, LLM-based textual explanations, facilitating user understanding of the generated explanations and evaluation outcomes. Human evaluation results indicate high user satisfaction with EvalxNLP, suggesting it is a promising framework for benchmarking explanation methods across diverse user groups. By ofering a user-friendly and extensible platform, EvalxNLP aims at democratizing explainability tools and supporting the systematic comparison and advancement of XAI techniques in NLP.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Explainable AI</kwd>
        <kwd>Feature Attribution</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Although significant progress has been made in the field of NLP, transformer-based models often
operate as black boxes, making it challenging to interpret their decision-making processes. In
highstakes domains like medical diagnosis and financial decision-making, transparency is essential for trust
and accountability. Despite the rapid development of explainability methods, there remains a lack of
standardized evaluation frameworks, particularly for NLP, where text data is inherently unstructured
and context-dependent. Existing explainability assessments vary widely, spanning qualitative user
studies and quantitative metrics like faithfulness and plausibility, yet no universal consensus exists on
the most efective approach.</p>
      <p>To address these challenges, we introduce EvalxNLP, a benchmarking framework for evaluating
post-hoc explainability methods in text classification tasks. EvalxNLP supports multiple explanation
techniques, and assesses them across key properties such as faithfulness, plausibility, and complexity.
In addition, EvalxNLP integrates LLM-based natural language explanations to facilitate the users’
understanding of the generated explanations and evaluations. We also conduct a user-based study to
evaluate the usability and user satisfaction with the framework. Our framework provides a systematic,
user-friendly platform that aims to democratize access to explainability tools, enabling both researchers
and practitioners to compare and refine XAI techniques for NLP applications. By ofering a unified and
reproducible evaluation methodology, EvalxNLP advances the field of explainability, promoting more
transparent and trustworthy AI systems.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Most existing explainability evaluation frameworks are designed for general-purpose applications,
meaning that they include explainability methods used in image or tabular applications. As a result, most
of them lack dedicated support for text-based models. OpenXAI [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], BEExAI [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and Quantus [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are
prime examples of supporting multiple data modalities but not including text-specific evaluation metrics.
Performance assessment is typically based on generic criteria, without tailored adaptations for NLP
tasks. Frameworks such as Inseq [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] support sequence generation models but lack built-in evaluation
metrics. XAI-Bench [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] evaluates explainability methods using synthetic data, which may not fully
capture real-world text applications [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. While these frameworks provide partial solutions, they do not
ofer a comprehensive suite for text explainability. ferret [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] facilitates explainability evaluation for NLP
models but supports only five feature-attribution methods and six evaluation metrics. Among existing
XAI libraries, it is the only one with adequate text-specific explainability features, to be considered as
an NLP-specialized explainability framework. But, it relies on some metrics that have been shown to be
inaccurate especially for measuring faithfulness (as explained in section 3.3.1); Captum [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] provides
implementations for 22 attribution methods but is limited to two evaluation metrics (Infidelity and
Sensitivity) and lacks built-in benchmarking. Similarly, AIX360 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] supports text data but evaluates only
two properties, faithfulness and monotonicity, without systematic benchmarking capabilities. M4 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
evaluates faithfulness for image and text modalities but does not assess plausibility or complexity. While
these frameworks address various aspects of explainability, among existing XAI libraries, only ferret
ofers a complete set of key features [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]: multiple explainability methods, Transformers-readiness (built
with close integration into the Hugging Face (HF) transformers library), evaluation APIs, explainable
datasets (i.e., those with human-annotated rationales), and built-in visualization. In addition to these
features, our tool also extends functionality by incorporating recent explainability methods, recent
metrics for evaluating explanation properties and providing an LLM-based module that generates natural
language explanations to enhance user understanding. It consolidates capabilities scattered across
multiple frameworks, ofering a robust suite for benchmarking, evaluation, and qualitative explanations.
By seamlessly integrating these features into one comprehensive framework, like EvalxNLP, we support
practitioners in benchmarking Ph-FA explanation methods.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. The Framework</title>
      <p>The EvalxNLP framework builds on the top of four main components described below. For the technical
implementation details, we refer the reader to the documentation provided in the repository. We release
the code, tutorials, and documentation in the following repository1.</p>
      <sec id="sec-3-1">
        <title>3.1. Explainers</title>
        <p>
          One goal of EvalxNLP is to enable users to generate diverse explanations through multiple explainability
methods. The explainer component integrates eight widely recognized explainability methods from
the XAI literature, specifically focusing on post-hoc feature attribution (Ph-FA) methods. EvalxNLP
incorporates two categories of post-hoc methods: gradient-based and perturbation-based approaches.
Gradient-based methods compute feature importance by leveraging gradients of the model’s output
with respect to its input features. They eficiently utilize backpropagation, making them well-suited
for deep learning models. EvalxNLP integrates five key methods, including Saliency (also called
Gradients)[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], which calculates raw gradients to highlight important inputs; Gradient×Input [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ],
which scales gradients by input values for enhanced clarity; Integrated Gradients [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], which averages
gradients along a path from a baseline to the input; DeepLift [14], which attributes diferences in
activations to individual inputs for more stable attributions; and Guided BackProp [15], which filters
negative gradients to highlight only positive contributions. EvalxNLP implements these methods
using Captum and ferret, providing a diverse set of eficient and interpretable attribution techniques..
Perturbation-based methods integrated into the toolbox include the widely used LIME [16] and SHAP
[17] methods, as well as the recently introduced SHAP with interactions method (SHAP-I) [18], which
augments traditional Shapley values by incorporating feature interactions [19], a notable contribution
over existing frameworks. The rationale behind providing a comprehensive range of Ph-FA methods
is twofold: (1) to ofer users access to a diverse set of explanations from established as well as novel
methods, enabling a holistic assessment and selection of explanations tailored to specific use cases;
and (2) to facilitate benchmarking, comparative analyses, and evaluations of these methods based on
selected evaluation criteria
        </p>
        <p>
          We implement the explanation methods by building on top of their original implementations (for
LIME and SHAP) and existing open-source libraries for the remaining methods. Specifically, we use the
original implementation for LIME. For SHAP, we utilize Partition SHAP, a variant that optimizes Shapley
value computation by exploiting feature independence. For SHAP-I, we extend the implementation
provided by the shapiq package [20], while gradient-based methods are implemented using the Captum
library [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Our approach of building upon established open-source libraries aims to facilitate and
support the expansion and development of open-source XAI libraries.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. LLM explanations</title>
        <p>Feature attribution explanations, especially those involving many features, can be dificult for lay users
to interpret [21]. To address this, our tool integrates an LLM-based module that automatically generates
natural language explanations to help users interpret both (1) the importance scores from various
explanation methods and (2) the evaluation metric scores. These textual explanations enhance the
comprehensibility of Ph-FA outputs, particularly for non-experts, and support more informed
decisionmaking by providing textual explanations for the evaluation metric scores Additionally, combining
visual heatmaps with textual explanations ofers a more accessible and holistic view of the model’s
decision-making process. To prevent the unfaithful textual explanations for model decisions by LLMs
[21], we don’t ask the LLM to provide its own explanations of the model decisions but only use LLM
solely to verbalize the outputs of explanation methods such as importance scores into natural language
to make them more comprehensible. Figure 1 presents an example of an explanation heatmap (Figure
1a) and LLM-generated explanation for the SHAP scores (Figure 1b) for a misclassified instance in the
MovieReviews dataset [22] that is misclassified by XLM-RoBERTa-base [ 23]. As shown in Figure 1b,
the LLM provides comprehensible textual explanation to ease understanding the scores by SHAP.</p>
        <p>The LLM is integrated into the framework via an API, enabling seamless generation of textual
explanations on demand. For LLM API support, our demo uses the Together AI2 API, with
Llama3.3-70B-Instruct-Turbo as the default model for generating explanations. Users can switch models or
providers and modify the LLM instructions as needed.
(a) Explanation heatmap
(b) LLM-generated explanation</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Metrics</title>
        <p>To evaluate explainability methods within our NLP framework, we use a comprehensive set of metrics
covering three key properties: faithfulness, plausibility, and complexity. These ensure that explanations
align with model reasoning while remaining interpretable and concise for users. By integrating diverse
metrics, the framework supports a rigorous, holistic assessment that balances model fidelity, human
interpretability, and explanation brevity. For mathematical details, we refer readers to the original
papers introducing these metrics. (↓)/(↑) indicates that lower/higher values are better for a given metric.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Faithfulness</title>
          <p>Faithfulness measures how well the generated explanations reflect the true behavior of the model.
Compared to other frameworks that use suficiency (suf) and comprehensiveness (comp) based on
complete token removal, an approach shown to produce inaccurate faithfulness measurements [24], we
employ soft suf and soft comp, which have proven more accurate in measuring faithfulness [ 25]. Soft
suf [26] ↓ : Evaluates how well the most important tokens can retain the model’s prediction when other
tokens are softly perturbed. It assumes that retaining more elements of important tokens should preserve
the model’s output, while dropping less important tokens should have minimal impact. Soft comp [26]
↑ : Measures how much the model’s prediction changes when important tokens are softly perturbed
using Bernoulli mask. It assumes that heavily perturbing important tokens should significantly afect the
model’s output, indicating their importance to the prediction. Feature Attribution Dropping (FAD)
Curve and Normalized Area Under Curve (N-AUC) [27] ↓ : Measures the impact of dropping the
most salient tokens on model performance, with the steepness of the FAD curve indicating the method’s
faithfulness. The N-AUC quantifies this steepness, where a lower score reflects better alignment
of the attribution method with the model’s true feature importance. Area Under the
ThresholdPerformance Curve (AUC-TP) [28] ↓ : AUC-TP evaluates the faithfulness of saliency explanations
by progressively masking the most important tokens (based on their saliency scores) and measuring
the drop in the model’s performance. This AUC-TP value provides a single metric summarizing how
significantly the model relies on the highlighted tokens, with lower values indicating better faithful
explanations.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Plausibility</title>
          <p>Plausibility measures how well the generated explanations align with human intuition. The following
metrics are incorporated: IOU-F1 Score [29] ↑ : Computes the Intersection over Union (IoU) between
predicted and ground-truth rationales, considering a (partial) match if the overlap is above 50% where
these matches are used to calculate F1 scores. Token-Level F1 Score [29] ↑ : Measures alignment by
calculating the F1-score between predicted and human-annotated rationales at the token level. Area
Under Precision-Recall Curve (AUPRC) [29] ↑ : Evaluates plausibility by comparing the saliency
scores of tokens with ground-truth rationale masks, computing the area under the precision-recall
curve.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Complexity</title>
          <p>Complexity evaluates how concise and interpretable the explanations are. Sparse explanations that
highlight only a few important features are preferred. We use the following metrics: Complexity
[30] ↓ : Measures how evenly importance scores are distributed using Shannon entropy. Higher values
indicate more complex explanations, while lower ones suggest concise attributions. Sparseness [31]
↑ : Computes the sparsity of attributions using the Gini index, where higher scores indicate more
concentrated importance on a few features.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Datasets</title>
        <p>EvalxNLP is a framework for text classification that combines robust evaluation and explanation
capabilities. It supports tasks like Sentiment Analysis, Hate Speech Detection, and Natural Language
Inference (NLI), using rationale-annotated datasets that highlight key text segments. These
humanprovided rationales enable assessment of how well model explanations align with human reasoning.
Currently, EvalxNLP includes a representative dataset for each of the aforementioned tasks and supports
these three datasets by default, while also allowing users to extend it with additional classification
datasets: MovieReviews: Designed for Sentiment Analysis, this dataset consists of 1,000 positive
and 1,000 negative movie reviews, each annotated with phrase-level human rationales that justify the
sentiment label. HateXplain [32]: Used for Hate Speech Detection, this dataset comprises 20,000 posts
from Gab and Twitter, annotated with one of three labels: hate speech, ofensive, or normal. e-SNLI [33]:
A dataset for Natural Language Inference containing 549,367 examples, split into training, validation,
and test sets. Each example includes a premise and a hypothesis labeled as entailment, contradiction, or
neutral.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Case Study</title>
      <p>To show the usability of EvalxNLP on real-world datasets, we present a case study showcasing how the
tool can be used for benchmarking explainers for a sentiment analysis task on the Movie Reviews that
include rationales [29] using an XLM-RoBERTa-base that is fine-tuned for sentiment analysis. EvalxNLP
enables users to generate explanations and benchmark explainers either for single instances (local
explanations) and across multiple instances (a subset or an entire dataset). This functionality depends
on the user’s intention, such as identifying the best explanation method with respect to specific metrics
and properties, either for individual sentences or aggregated across datasets. In this case study, we
demonstrate how EvalxNLP benchmarks explainers using the full Movie Reviews dataset by aggregating
evaluation metrics across all instances. Figure 2 presents the results indicating that DL achieves the
highest overall faithfulness scores, particularly on soft metrics, suggesting it produces the most faithful
explanations for this dataset. IG performs best on complexity metrics, indicating its explanations are
simpler and easier to understand, while SHAP outperforms other methods on plausibility metrics,
showing that its explanations align closely with human intuition. As expected, no single explanation
method excels in all properties, confirming that practitioners must select methods based primarily on
the evaluation property most relevant to their specific use case.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Human Evaluation</title>
      <p>We also conducted a user-based study involving 20 participants to assess the usability of the tool. The
participants were provided with instructions on how to run the tool and try its diferent functionalities.
The user study first collects demographic data, including participants’ profession, NLP experience,
and prior exposure to benchmarking tools. Then, they were asked to evaluate the system based on
some criteria using a 5-point Likert scale to measure usability and satisfaction, where 1 and 5 mean
the worst and best values, respectively, for each criterion (For brevity, we don’t present the questions
here and refer to Figure 3). Figure 3 presents the results of the human evaluation where these results
were collected prior to the integration of the LLM component, which was later added to enhance the
understanding of explanations generated by the various explainers. Another round of human evaluation
is planned to assess the efectiveness of adding the LLM-component.</p>
      <p>(a) Demographics information
(b) Human evaluation scores</p>
      <p>Based on results in Figure 3b, The overall results are promising, with scores consistently above 3 out
of 5 across all criteria for both participant groups. Particularly, the framework is easy to use, and all
users, especially those with greater NLP experience, find the results easy to interpret. However, for the
remaining criteria, participants with less NLP experience provided higher ratings compared to their
more experienced counterparts. This indicates there remains room for improvement, particularly in
enhancing the framework’s ability to meet the benchmarking needs of more experienced users.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Given the increasing number of available explainability methods and the diverse requirements
stakeholders may have, there is a rising need for continued contributions to existing and new frameworks
that support stakeholders in obtaining and selecting appropriate explanations tailored to their specific
use cases. To address this gap, we introduce EvalxNLP, a novel Python framework designed to
benchmark state-of-the-art Ph-FA explainability methods for transformer-based NLP models, particularly
targeting classification tasks. The framework enables users to generate and evaluate explanations at
the single-instance level and across entire real-world datasets across diferent metrics for three main
explainability properties: faithfulness, complexity, and plausibility. EvalxNLP is targeted for use by
various stakeholders, including laypeople, developers, and researchers, depending on their goals and
also where certain properties could be more critical for specific users. For instance, developers can
employ EvalxNLP to debug models and compare Ph-FA methods, prioritizing faithfulness metrics
relevant to their needs. Our framework is developed to be easily extensible by the research community.
Limitations of the framework include focusing on classification tasks and utilizing feature-attribution
explainability methods. Future directions include expanding the supported methods and metrics,
such as integrating recent non-feature attribution techniques like [34] and robustness metrics such as
sensitivity [35]. We also plan to incorporate users feedback to refine explanation quality, and to extend
the framework to generate premise–conclusion rules from Ph-FA methods to enhance explainability[36].</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We would like to thank the anonymous reviewers for their helpful suggestions. This research has been
supported by the German Federal Ministry of Education and Research (BMBF) grant 01IS23069 Software
Campus 3.0 (TU München).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author has not employed any Generative AI tools.
[14] A. Shrikumar, P. Greenside, A. Kundaje, Learning important features through propagating
activation diferences, in: Proc. of ICML, 2017.
[15] J. T. Springenberg, A. Dosovitskiy, T. Brox, M. Riedmiller, Striving for simplicity: The all
convolutional net, arXiv preprint arXiv:1412.6806 (2014).
[16] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i trust you?" explaining the predictions of any
classifier, in: Proc. of ACM SIGKDD, 2016.
[17] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, NeurIPS (2017).
[18] M. Muschalik, H. Baniecki, F. Fumagalli, P. Kolpaczki, B. Hammer, E. Hüllermeier, shapiq: Shapley
interactions for machine learning, NeurIPS (2024).
[19] F. Fumagalli, M. Muschalik, P. Kolpaczki, E. Hüllermeier, B. Hammer, Shap-iq: Unified
approximation of any-order shapley interactions, NeurIPS (2023).
[20] M. Muschalik, H. Baniecki, F. Fumagalli, P. Kolpaczki, B. Hammer, E. Hüllermeier, shapiq: Shapley
interactions for machine learning, in: NeurIPS Datasets and Benchmarks Track, 2024.
[21] N. Feldhus, L. Hennig, M. D. Nasert, C. Ebert, R. Schwarzenberg, S. Möller, Saliency map
verbalization: Comparing feature importance representations from model-free and instruction-based
methods, in: Proc. of NLRSE Workshop at ACL, 2023.
[22] O. Zaidan, J. Eisner, Modeling annotators: A generative approach to learning from annotator
rationales, in: Proc. of EMNLP, 2008.
[23] F. Barbieri, L. Espinosa Anke, J. Camacho-Collados, XLM-T: Multilingual language models in</p>
      <p>Twitter for sentiment analysis and beyond, in: Proc. of LREC, 2022.
[24] Z. Zhao, G. Chrysostomou, K. Bontcheva, N. Aletras, On the impact of temporal concept drift on
model explanations, in: Findings of ACL: EMNLP, 2022.
[25] Z. Zhao, N. Aletras, Incorporating attribution importance for improving faithfulness metrics, in:</p>
      <p>Proc. of ACL (Long Papers), 2023.
[26] Z. Zhao, N. Aletras, Incorporating attribution importance for improving faithfulness metrics, in:</p>
      <p>Proc. of ACL (Long Papers), 2023.
[27] H. Ngai, F. Rudzicz, Doctor XAvIer: Explainable diagnosis on physician-patient dialogues and</p>
      <p>XAI evaluation, in: Proc. of Workshop on Biomedical Language Processing, 2022.
[28] P. Atanasova, A diagnostic study of explainability techniques for text classification, in: Accountable
and Explainable Methods for Complex Reasoning over Text, 2024.
[29] J. DeYoung, S. Jain, N. F. Rajani, E. Lehman, C. Xiong, R. Socher, B. C. Wallace, ERASER: A
benchmark to evaluate rationalized NLP models, in: Proc. of ACL, 2020.
[30] U. Bhatt, A. Weller, J. M. F. Moura, Evaluating and aggregating feature-based model explanations,</p>
      <p>Proc. of IJCAI (2021).
[31] P. Chalasani, J. Chen, A. R. Chowdhury, X. Wu, S. Jha, Concise explanations of neural networks
using adversarial training, in: Proc. of ICML, 2020.
[32] B. Mathew, P. Saha, S. M. Yimam, C. Biemann, P. Goyal, A. Mukherjee, Hatexplain: A benchmark
dataset for explainable hate speech detection, in: Proc. of AAAI, 2021.
[33] O.-M. Camburu, T. Rocktäschel, T. Lukasiewicz, P. Blunsom, e-snli: Natural language inference
with natural language explanations, NeurIPS (2018).
[34] T. Leemann, A. Fastowski, F. Pfeifer, G. Kasneci, Attention mechanisms don’t learn additive
models: Rethinking feature importance for transformers, TMLR (2025).
[35] C.-K. Yeh, C.-Y. Hsieh, A. Suggala, D. I. Inouye, P. K. Ravikumar, On the (in)fidelity and sensitivity
of explanations, in: NeurIPS, 2019.
[36] L. Rizzo, D. Verda, S. Berretta, L. Longo, A novel integration of data-driven rule generation and
computational argumentation for enhanced explainable ai, Machine Learning and Knowledge
Extraction 6 (2024) 2049–2073.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pawelczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , I. Puri,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zitnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lakkaraju</surname>
          </string-name>
          , Openxai:
          <article-title>Towards a transparent evaluation of model explanations</article-title>
          ,
          <source>NeurIPS</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sithakoul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Meftah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Feutry</surname>
          </string-name>
          , Beexai:
          <article-title>Benchmark to evaluate explainable ai</article-title>
          , in: WC on
          <string-name>
            <surname>Explainable</surname>
            <given-names>AI</given-names>
          </string-name>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hedström</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Krakowczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bareeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Motzkus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Samek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lapuschkin</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M.- C. Höhne</surname>
            ,
            <given-names>Quantus:</given-names>
          </string-name>
          <article-title>An explainable ai toolkit for responsible evaluation of neural network explanations and beyond</article-title>
          ,
          <source>JMLR</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sarti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Feldhus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sickert</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. van der Wal</surname>
          </string-name>
          ,
          <article-title>Inseq: An interpretability toolkit for sequence generation models</article-title>
          ,
          <source>in: Proc. of ACL (System Demonstrations)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khandagale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Neiswanger</surname>
          </string-name>
          ,
          <article-title>Synthetic benchmarks for scientific research in explainable machine learning</article-title>
          ,
          <source>in: NeurIPS Datasets and Benchmarks Track</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Faber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Moghaddam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wattenhofer</surname>
          </string-name>
          ,
          <article-title>When comparing to ground truth is wrong: On evaluating gnn explanation methods</article-title>
          ,
          <source>in: Proc. of ACM SIGKDD</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Di Bonaventura</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Nozza, ferret: a framework for benchmarking explainers on transformers</article-title>
          ,
          <source>in: Proc. of EACL (System Demonstrations)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Miglani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Markosyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garcia-Olano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kokhlikyan</surname>
          </string-name>
          ,
          <article-title>Using captum to explain generative language models</article-title>
          ,
          <source>in: Proc. of NLP-OSS Workshop at ACL</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Arya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Bellamy</surname>
          </string-name>
          , P.-Y. Chen,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dhurandhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Hofman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Houde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Luss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mojsilović</surname>
          </string-name>
          , et al.,
          <source>Ai</source>
          explainability
          <volume>360</volume>
          :
          <article-title>Impact and design</article-title>
          ,
          <source>in: Proc. of AAAI</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lakkaraju</surname>
          </string-name>
          , H. Xiong,
          <article-title>M4: A unified xai benchmark for faithfulness evaluation of feature attribution methods across metrics, modalities and models</article-title>
          ,
          <source>NeurIPS</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Deep inside convolutional networks: Visualising image classification models and saliency maps</article-title>
          ,
          <source>arXiv preprint arXiv:1312.6034</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shrikumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Greenside</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kundaje</surname>
          </string-name>
          ,
          <article-title>Learning important features through propagating activation diferences</article-title>
          ,
          <source>in: Proc. of ICML</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundararajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Taly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Axiomatic attribution for deep networks</article-title>
          ,
          <source>in: Proc. of ICML</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>