<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Coverage of LLM Trustworthiness Metrics in the Current Tool Landscape</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lennard Helmer</string-name>
          <email>lennard.helmer@iais.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benny Stein</string-name>
          <email>benny.joerg.stein@iais.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Ufer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elanton Fernandes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hammam Abdelwahab</string-name>
          <email>hammam.abdelwahab@iais.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhinav Pareek</string-name>
          <email>abhinav.pareek@iais.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joshua Woll</string-name>
          <email>joshua.woll@iais.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS</institution>
          ,
          <addr-line>St. Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The increasing prevalence of AI systems that are build with Large Language Model (LLM) components raises the requirement for a dedicated tool stack that allows to monitor such systems, covering training, development and inference environments. Beside technical performance metrics like latency and throughput, regulations like the EU AI Act require the monitoring of trustworthiness related metrics like fairness and transparency during operation. In this paper, we describe the results of an investigation we conducted to gain an overview of the current landscape of LLM trustworthiness metrics and their coverage in monitoring tools. Based on an in-depth analysis of available catalogs and additional research, we identified 43 metrics and 23 tools. Furthermore, we highlight existing gaps and potential areas for further research. The results support practitioners and researchers in making informed decisions about the most appropriate tech stack for their AI systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Artificial Intelligence</kwd>
        <kwd>Generative AI</kwd>
        <kwd>Trustworthy AI</kwd>
        <kwd>Responsible AI</kwd>
        <kwd>MLOps</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>the metrics that we identified in Section 4, and the tools in Section 5. Our findings are discussed in
Section 6. Finally, we give a conclusion of our work and an outlook in Section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>
        Monitoring is an essential aspect of any IT system development. Nevertheless, the unique characteristics
and challenges of AI systems emphasize the requirement for monitoring of aspects like trustworthiness.
Unfortunately the trustworthiness of AI is not an objective state, and is influenced by several dimensions
that influence each other [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], adding complexity to the oversight of such systems.
      </p>
      <p>
        The currently most widespread paradigm for developing and operating AI systems is Machine
Learning Operations (MLOps), and it emphasizes the importance of monitoring in operation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Pipelines
for the various phases, such as data preprocessing, training and evaluation, as well as monitoring of
these phases, are designed to ensure the quality of the artifacts. Although trustworthiness is not yet
explicitly part of the paradigm, first steps have been taken to enhance it and integrate trustworthy
AI principles into MLOps processes [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A 2024 survey [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] of small and medium-sized enterprises in
Germany on the implementation of MLOps found that while most are aware of its importance, few have
a well-structured approach to monitoring their AI systems. Most rely on “human-based monitoring”
and the ability of users to identify and report AI errors during inference. While this may work in certain
domains and applications, such as machine learning based classification, generative AI makes it much
more dificult to detect incorrect AI outputs.
      </p>
      <p>In the following subsection, we describe the foundation of AI trustworthiness and its dimensions. After
that, we state other related work that we identified during our research, for example targeting aspects
of LLM monitoring during (pre-)training and the unique challenges that LLMs pose for monitoring.</p>
      <sec id="sec-2-1">
        <title>2.1. Trustworthy AI</title>
        <p>
          In addition to a list of metrics, the OECD catalogue provides a mapping to specific objectives, too.
Those objectives are Data Governance &amp; Traceability (DGT), Digital Security (DS), Environmental
Sustainability (ES), Explainability (EXP), Fairness (FA), Human Agency &amp; Control (HAC), Performance
(PF), Privacy (PR), Robustness (RN), Safety (SF), and Transparency (TR). Although there is no specific
description of the scope of the individual objectives, they can be understood in a generic sense. The
underlying framework of the catalogue is specified further in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          The objectives described before roughly map to the trustworthiness dimensions known from other
eforts in trustworthy AI research. For example, the AI Assessment Catalogue [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] defines a large
variety of testing tasks, spanning over six dimensions of trustworthiness for evaluating AI applications:
Fairness, Autonomy &amp; Control, Transparency, Reliability, Security, and Privacy. It provides a structured
basis for evaluating AI systems, both for developers and assessors.
        </p>
        <p>
          Furthermore, there is ongoing research on LLM-specific trustworthiness dimensions in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and in
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In the latter work introduced, four AI safety levels (ASL) were introduced that relate model scaling
eforts to appropriate safety procedures.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Other Related Work</title>
        <p>
          The rapid progress in language models (LMs) has resulted in the training of increasingly large LMs
on massive quantities of data [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. LLMs are trained with a substantial amount of curated data and
web-based data, predominantly web-based data such as Common Crawl [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Data pipelines for
webbased data manage data preparation including text extraction, language identification, rigorous filtering,
and deduplication, such as the ones in C4 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], The Pile [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], RefinedWeb [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], CCNet [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], BigScience
ROOTS [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], OSCAR [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], and FineWeb [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. After pre-training, instruction tuning is conducted to
improve the performance of LLMs on several desired tasks [
          <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
          ].
        </p>
        <p>
          Trustworthiness has been considered in some of the research on pre-training pipelines for LLMs.
For example, the Ungoliant pipeline uses a pre-trained perplexity-based KenLM to filter unsafe data
[
          <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
          ]. Furthermore, FineWeb’s pipeline includes filtering of Personal Identifiable Information [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. In
addition, there are several ongoing research eforts on ensuring safety of LLMs by generating instruction
tuning data sets, and using them for training or evaluation. Among them are WildGuard [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] and
DecodingTrust [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. Human evaluation as part of reinforcement learning indirectly contributes to LLM
trustworthiness [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], too.
        </p>
        <p>
          Recent studies [
          <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
          ] have proposed a range of tools to assess diferent dimensions of trustworthiness.
For example, tools that focus on assessing fairness [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] often use bias detection metrics to assess
disparities across demographic groups. While there are studies [
          <xref ref-type="bibr" rid="ref28 ref8">8, 28</xref>
          ] on the subject of trustworthiness
of LLMs, to the best of our knowledge there are no papers yet that systematically investigate the
coverage of LLM trustworthiness metrics in software tools.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In the following, we describe the general methodology that we applied to identify the metrics and tools
for LLM monitoring.</p>
      <p>Starting point is the OECD catalog for trustworthy AI tools and metrics, which curates an up-to-date
list of trustworthiness metrics and tools, based on feedback from the community and by automated
searches in GitHub. We filtered for metrics that were tagged with the purpose “Content generation”
and the life cycle stage “Operate &amp; monitor”. For tools, we used the tags “technical” and “operational &amp;
monitoring”.</p>
      <p>Additionally, we performed literature searches for metrics as well as tools, based on search queries that
we formulated using keywords commonly discussed in trustworthy AI literature (e.g., trustworthiness).
Each query included terms related to the attribute of interest (e.g., fairness, safety) combined with terms
representing measurement and evaluation (e.g., metric, assessment, framework, tool). Additionally,
each query targeted literature specifically mentioning “large language model”, “LLM”, or “generative
AI” to ensure relevance to this investigation.</p>
      <p>The relevance of each publication was assessed based on the number of citations and judgment of the
authors of this paper. A notable diference in case of literature related to tools is that the citation count
is not as relevant, as it is a reasonable objective measure for the importance of an academic publication,
but not meaningful to assess the prevalence of a software tool. We took again the number of citations
into account if a matching publication for a tool was available, but also GitHub stars, which are used
by the developer community to mark noticeable repositories. On top of that, we also made use of the
authors’ experience from multiple industry projects that dealt with the guard-railing of LLMs. Each
tool was analyzed using publicly available websites, the tool’s documentation (if available), and a code
analysis if the source code was published. The available information was analyzed for mentioning of
the metrics that we had previously identified. While many tools appear promising, not many of them
publicly state which metrics are available. In particular proprietary tools do not reveal many details
about available metrics. Thus, they were usually not considered.</p>
      <p>Our list of judge-based metrics was put together using a literature search on LLM-as-a-judge topics,
and additionally scanning the documentation of the most well-known software tools for LLM Testing
that were identified in the search process for the tools.</p>
      <p>The search process was conducted between the 14th and 17th of April, 2025.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Metrics</title>
      <p>In this section, we discuss the available metrics for the trustworthiness monitoring of LLMs. We start
by presenting and describing our complete collection in Section 4.1, and after that give details about the
judge-based metrics in Section 4.2.</p>
      <sec id="sec-4-1">
        <title>4.1. Description of the Final Collection</title>
        <p>In the following, we provide an overview of the metrics identified through the process described in
Section 3, categorized by their respective objective and summarize the results in Table 12. The prefix
O of an ID in the table indicates that the metric was found in the OECD catalog, while the prefix A
indicates that the tool was added based on additional research. The prefix J shows that the metric is
based on an LLM-as-a-judge approach. It should be noted that some of the metrics from the OECD
catalog also rely on a judge approach.</p>
        <p>In total, we analyzed 59 metrics and consider 43 of those to be both related to the trustworthiness
of LLMs and to be used in monitoring. We excluded metrics that appeared in our research, but where
the results did not reveal their suitability for either LLMs or monitoring purposes. To gain a broader
perspective, we also searched gray literature such as blog entries and articles for approaches to use
the identified metrics in LLM monitoring before deciding whether or not to include a metric in our
collection. Of these 43 metrics, 17 are listed in the OECD metrics catalog, 7 were identified during the
literature review, and 19 were identified through dedicated research on LLM-as-a-judge approaches.
Metrics of those type are discussed in more detail in Section 4.2.</p>
        <p>To better understand the distribution of the metrics across the trustworthiness objectives mentioned
in Section 2.1, we classified the objective for each of the additional metrics that were identified. We
made the classification to the best of our knowledge and based on the documentation available for each
metric. For the metrics from the OECD catalog, we used the classification from the catalog itself.</p>
        <p>We found that none of the metrics are suitable for the objectives of Data Governance &amp; Traceability,
Digital Security and Environmental sustainability, so these three objectives are not listed in Table 1 at all.</p>
        <p>Two thirds of the metrics (28) can be used to measure the Robustness of an LLM-based system, which
is not surprising as most trustworthiness dimensions aim to ensure that the AI system behaves in
an expected manner. It also reflects the fact that one of the most studied shortcomings of LLMs are
hallucinations, and the robustness objective addresses that. Performance is the second largest group,
with 20 metrics that could be used to provide insight. 14 metrics are suitable for measuring
Safetyrelated issues, which is probably related to the widespread awareness of issues such as prompt injection.
Explainability (10), Human Agency &amp; Control (9), Fairness (8) and Transparency (7) are close in terms of
group size of suitable metrics. Privacy is clearly the smallest group of related metrics (4).</p>
        <p>Figure 1 illustrates the distribution of metrics across the objectives. The blue bars represent the total
number of available metrics for each objective, while the green bars indicate the subset of metrics that
are currently addressed by existing tools. This visualization does not include additional information
beyond the one contained in Table 1, but illustrates the uneven development of assessment capabilities
across trustworthy AI objectives and highlights areas requiring enhanced tooling support. It should be
noted that the image can only serve as a general orientation, as not every metric is equally valuable but
they are all given the same weight in this visualization.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. On Judge-based Metrics</title>
        <p>
          Judge-based metrics rely on a recently established method to improve trustworthiness of LLMs. They are
constructed from a “judge” LLM that is given both an instruction for and an answer from the LLM under
examination, and a second instruction for the judge itself. The judge scores the answer based on these
inputs and context (if available), with the specific scoring procedure varying across diferent metrics.
We refer to the original work [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] for details. Judge-based metrics are typically use-case agnostic, highly
configurable and also very efective, but they fall short in terms of consistency, reproducibility and are
very sensitive to their configuration, such as the prompts that are being used for the judge LLMs. It
should be noted that is still unclear how reliable these are in general [
          <xref ref-type="bibr" rid="ref38 ref39 ref40 ref41">38, 39, 40, 41</xref>
          ], but we decided
2For comprehensive descriptions of all of the identified metrics, the reader is referred to the sources in the last column of the
table. It contains several links to overview pages from specific tools - the individual links to the metrics were not provided to
not clutter the table, but the reader should be able to navigate to them. We also did not include additional sources for the
metrics from the OECD catalog.
a https://oecd.ai/en/catalogue/metrics
b https://www.deepeval.com/docs/metrics-introduction
c https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/
d https://mlflow.org/docs/latest/llms/llm-evaluate/#llm-evaluation-metrics
e https://www.confident-ai.com/blog/how-i-built-deterministic-llm-evaluation-metrics-for-deepeval
f https://docs.giskard.ai/en/stable/knowledge/catalogs/test-catalog/text_generation/index.html
g Llama Guard is a special case of the Moderation metric with a specific Judge model and discrete output.
h https://www.comet.com/docs/opik/evaluation/metrics/moderation
Metric counter
Metric counter
(covered by tools)
        </p>
        <p>C
HA</p>
        <p>ance
that they should be included in the picture, as they are widely adapted and are promising in designing
application-specific metrics quite easily.</p>
        <p>
          In the following, we describe the judge-based metrics listed in Table 1. A very general metric that
can be defined only by some evaluation criteria in natural language is G-Eval [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]. It relies on the
interpretation of the criteria by the judge model, and thus is very judge-dependent. Initially, G-Eval
was only using GPT-3.5 and GPT-4 for evaluation. With Prometheus [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ], an efort was made to have
an open-source alternative replacing the GPT models. A more recent and more deterministic approach
that uses decision trees for evaluation is given by DAG (Deep Acyclic Graph)3.
        </p>
        <p>
          Some of the judge-based metrics are actually based on a more general method called Question Answer
Generation (QAG), introduced in the context of summarization in [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ]. Here, a (judge) LLM is used to
ifrst extract all claims in an LLM’s response, and then the number of correct claims (identified again
by the judge, usually using context or even ground truth, if available) is turned into a score or ratio.
In our list, OM03, OM04, OM06, OM07, JM09, JM10 and JM11 are defined via such a procedure. With
the increasing adoption of Agentic AI and the central role that LLMs play in such architectures, novel
metrics are constructed that are either suitable for step-by-step evaluation of Agentic AI applications
(similar to what unit tests do in software architectures), or for trajectory evaluation, ensuring that
the decision-making process of an agent is using the appropriate tools, inputs, and memory in the
appropriate spot and in the right order. Metrics OM09 and JM12 address some of these Agent-specific
aspects. We refer to [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ] for a more complete picture on the topic of observability in Agentic AI.
        </p>
        <p>
          Further metrics that are not listed here are BERTScore [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ], MOVERScore [
          <xref ref-type="bibr" rid="ref46">46</xref>
          ] (both
embeddingbased), BARTScore [
          <xref ref-type="bibr" rid="ref47">47</xref>
          ], UniEval [
          <xref ref-type="bibr" rid="ref48">48</xref>
          ], and metrics that combine statistical and model-based approaches,
such as GPTScore [
          <xref ref-type="bibr" rid="ref49">49</xref>
          ] and SelfCheckGPT [
          <xref ref-type="bibr" rid="ref50">50</xref>
          ]. We decided to not mention them anymore here as it
was shown in [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] that they are consistently outperformed by G-Eval (which is not surprising keeping
in mind how they are constructed in comparison to G-Eval).
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Tools</title>
      <p>In this chapter, we discuss the available tools for LLM trustworthiness monitoring and show which
of the metrics from the previous section they cover. We stress that a thorough comparison or holistic
benchmark for the identified tools is not in the scope of this work.
3https://www.confident-ai.com/blog/how-i-built-deterministic-llm-evaluation-metrics-for-deepeval</p>
      <p>Although the OECD catalog itself consists of more than 900 entries for tools, the filtering by the tags
“technical” and “operational &amp; monitoring” leaves only 95 tools for detailed analysis. After this analysis
and the inclusion of additional tools described in detail in Section 3, we obtained 23 tools in total that
we consider to be suitable for the research goals of this paper. The complete list of tools is shown in
Table 2, along with the metrics that were identified.</p>
      <p>In Table 2, the prefix O of an ID in the table indicates again that the tool was found in the OECD
catalog, while the prefix A indicates that the tool was added based on additional research. Some metrics
are available as a tool by themselves (e.g. LIME), so they appear on both lists: metrics and tools.
Source
ahttps://github.com/Mindgard/cli bhttps://boschaishield.com/ chttps://www.quantpi.com/
d https://github.com/Giskard-AI/giskard ehttps://github.com/Trusted-AI/adversarial-robustness-toolbox
f https://aws.amazon.com/de/sagemaker-ai/clarify/ ghttps://arize.com/docs/phoenix
hhttps://h2o.ai/platform/enterprise-h2ogpte/model-validation/ ihttps://mlflow.org/docs/3.4.0/genai/
jhttps://shap.readthedocs.io/en/latest/index.html khttps://www.comet.com/docs/opik/
lhttps://docs.evidentlyai.com
The coverage of the metrics by the tools is shown in Figure 2 and discussed in the following.</p>
      <p>The first result is that there are several tools available that can be used for LLM trustworthiness
monitoring, and that each of those covers at least one of the metrics that we identified. Conversely, we
observe that 11 metrics are not yet available in the tools that we identified and thus may not be applied
easily. None of the tools contains all or at least almost all of the identified metrics, so practitioners will
likely have to rely on several tools to cover a broad range of trustworthiness aspects in their application
– at least in the near future.</p>
      <p>What can also be seen directly from Figure 2 is that judge-based metrics are covered much more
often than the others. Specifically, the most often implemented metrics are Toxicity (JM16), and then
Prompt Injection (JM08), Information Disclosure Detector (JM19), Response/Answer Relevancy (OM06)
and Faithfulness (OM07). All of these metrics rely on a judge approach.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>Metric gaps and limitations. The fact that only four metrics are available with the objective of
Privacy highlights a significant gap in current research, which is further confirmed by the generic nature
of two of these metrics. The underrepresentation of privacy objectives is concerning, particularly given
that the identified privacy metrics are judge-based and not yet included in the OECD catalog. Moreover,
the OECD catalog lacks Transparency metrics and only includes two Fairness metrics, emphasizing the
importance of incorporating additional metrics beyond the catalog.</p>
      <p>Judge-based metrics, however, exhibit a notable lack of coverage in the objectives of Explainability
and Transparency, likely due to fact that judges only work with the given instructions and context, and
are not able to explain or reveal anything beyond that (except for what they “saw” during training).</p>
      <p>The metrics presented in this work are subject to certain limitations. For instance, some metrics
are task-specific (e.g., ROUGE for summarization or CLEVER score and LIME for classification), while
others are scoped for other reasons such as relying on benchmarks or datasets that may not accurately
reflect real-world deployment scenarios. Most metrics also inherently depend on randomization, such
as training-test splits. Additionally, some metrics may be costly in terms of LLM usage, either due to
large data set processing requirements or the need for GPU-intensive model invocations, as for the
LLMs used in the judge-based metrics.</p>
      <p>Tool limitations. Concerning the tools, it is noteworthy that despite the OECD catalog’s extensive
list, many tools are not technical or suitable for monitoring, resulting in a relatively small number of
relevant tools for our purpose.</p>
      <p>We observed that the coverage of metrics by available tools is limited, with only a few metrics being
well-represented. This suggests a lag in the translation of research into usable software, potentially due
to tool developers’ focus on certain aspects of trustworthiness, such as prompt injection, at the expense
of less prominent issues.</p>
      <p>Our emphasis on using only publicly available information and avoiding black boxes for transparency
reasons may have implications for the coverage of proprietary tools. They are not yet covered to an
acceptable degree in our work due to restricted documentation, despite the fact that some proprietary
tools have accessible documentation about the available metrics before a paywall. However, open-source
software tools, which comprise the majority of our list, are expected to have much better coverage.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this work, we have provided an overview of the current state of trustworthiness monitoring, including
collections of available metrics and tools, to serve as a guideline for researchers and practitioners. Our
compilation highlights the trustworthiness objectives and corresponding metrics, as well as the tools
that assess them, aiming to identify areas with insuficient coverage for researchers and facilitate the
improvement of trustworthiness in specific aspects. Practitioners can use the compilations from Table 1
and Table 2 as a reference.</p>
      <p>
        As the landscape of available tools continues to rapidly evolve, their capabilities are constantly
improving, too. Additionally, the utilization of AI agents – which can anyway be seen as a means
to more responsible AI [62] – will expand the monitoring component to include further aspects, in
particular in terms of safety and tracing. The recently proposed paradigm of AgentOps [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ] addresses
this aspect, and will very likely be followed by others soon. Due to this very dynamic situation, the
snapshot described here is likely to be outdated soon.
      </p>
      <p>What is missing is a general framework that enables easy selection of applicable metrics for specific
settings or applications. So far, it is hard to even find (not to mention implement) a set of metrics that
yields an appropriate coverage of all trustworthiness aspects in the specific setting. A possible next
step is the creation of an adaptive framework that can dynamically adjust to evolving trustworthiness
requirements and incorporate new metrics and tools as they become available. This demands for
standardized tooling and interoperability protocols to enable seamless integration and comparison of
diferent trustworthiness monitoring solutions. Thus, we encourage researchers and practitioners to
collaborate on the development of standardized, adaptive, and interoperable trustworthiness monitoring
frameworks.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work has been funded by the Fraunhofer Cluster of Excellence Cognitive Internet Technologies
and the German Federal Ministry for Economic Afairs and Climate Action (BMWK) through the project
OpenGPT-X (project no. 68GX21007D).</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used DeepL and Claude in order to: Grammar, spelling
check and producing code for generating the visuals. Further, one of the authors used Llama 3 for
improving the wording of individual paragraphs. After using these tools, the authors reviewed and
edited the content as needed and take full responsibility for the publication’s content.
ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems,
2024, pp. 203–213.
[54] L. Weiner, Troj.ai detect &amp; defend, https://www.troj.ai/products/, 2025. Accessed: 2025-07-26.
[55] M. T. Ribeiro, S. Singh, C. Guestrin, "Why Should I Trust You?": Explaining the Predictions of Any
Classifier, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, 2016, pp. 1135–1144.
[56] R. K. E. Bellamy, K. Dey, M. Hind, S. C. Hofman, S. Houde, K. Kannan, P. Lohia, J. Martino,
S. Mehta, A. Mojsilovic, S. Nagar, K. N. Ramamurthy, J. Richards, D. Saha, P. Sattigeri, M. Singh,
K. R. Varshney, Y. Zhang, AI Fairness 360: An Extensible Toolkit for Detecting, Understanding,
and Mitigating Unwanted Algorithmic Bias, 2018. URL: https://arxiv.org/abs/1810.01943.
[57] S. Kumar, R. Shokri, Ml privacy meter: Aiding regulatory compliance by quantifying the privacy
risks of machine learning, in: Workshop on Hot Topics in Privacy Enhancing Technologies
(HotPETs), 2020.
[58] I. Webster, M. D’Angelo, S. Klein, G. Zang, promptfoo, 2025. URL: https://github.com/promptfoo/
promptfoo.
[59] ExplodingGradients, Ragas: Supercharge your llm application evaluations, https://github.com/
explodinggradients/ragas, 2024.
[60] J. Ip, K. Vongthongsri, deepeval, 2025. URL: https://github.com/confident-ai/deepeval.
[61] V. Arya, R. K. E. Bellamy, P.-Y. Chen, A. Dhurandhar, M. Hind, S. C. Hofman, S. Houde, Q. V.</p>
      <p>Liao, R. Luss, A. Mojsilović, S. Mourad, P. Pedemonte, R. Raghavendra, J. Richards, P. Sattigeri,
K. Shanmugam, M. Singh, K. R. Varshney, D. Wei, Y. Zhang, One explanation does not fit all: A
toolkit and taxonomy of ai explainability techniques, 2019. URL: https://arxiv.org/abs/1909.03012.
[62] Q. Lu, L. Zhu, X. Xu, Z. Xing, S. Harrer, J. Whittle, Towards Responsible Generative AI: A Reference
Architecture for Designing Foundation Model Based Agents, in: 2024 IEEE 21st International
Conference on Software Architecture Companion (ICSA-C), 2024, pp. 119–126. doi:10.1109/
ICSA-C63560.2024.00028.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , W.-L. Chiang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          ,
          <article-title>Judging LLM-as-a-Judge with MT-Bench and</article-title>
          <string-name>
            <surname>Chatbot Arena</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2306.05685. doi:
          <volume>10</volume>
          .48550/arXiv.2306.05685. arXiv:
          <volume>2306</volume>
          .
          <fpage>05685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Schleiger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Douglas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kuhnert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Resolving ethics trade-ofs in implementing responsible AI</article-title>
          ,
          <source>arXiv preprint arXiv:2401.08103</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kreuzberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kühl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hirschl</surname>
          </string-name>
          ,
          <article-title>Machine Learning Operations (MLOps): Overview, Definition, and Architecture</article-title>
          ,
          <source>IEEE access 11</source>
          (
          <year>2023</year>
          )
          <fpage>31866</fpage>
          -
          <lpage>31879</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Helmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wegener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Akila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abbas</surname>
          </string-name>
          ,
          <string-name>
            <surname>Towards Trustworthy AI Engineering - A Case</surname>
          </string-name>
          <article-title>Study on integrating an AI audit catalog into MLOps processes</article-title>
          ,
          <source>in: Proceedings of the 2nd International Workshop on Responsible AI Engineering</source>
          , RAIE '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . URL: https://doi.org/10.1145/3643691.3648584. doi:
          <volume>10</volume>
          .1145/3643691.3648584.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Helmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kerbel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Temath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wegener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zorn</surname>
          </string-name>
          ,
          <article-title>Machine Learning Operations (MLOps): Grundlagen, Chancen und Herausforderungen beim MLOps-Einsatz in</article-title>
          <string-name>
            <surname>Unternehmen</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://doi.org/10.24406/publica-2962.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>OECD</surname>
          </string-name>
          ,
          <article-title>Tools for trustworthy AI: A framework to compare implementation tools for trustworthy AI systems</article-title>
          , OECD Digital Economy Papers No.
          <volume>312</volume>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .1787/008232ec-en.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Poretschkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schmitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Akila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Adilova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Cremers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hecker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Houben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rosenzweig</surname>
          </string-name>
          , et al.,
          <source>Guideline for Trustworthy Artificial Intelligence-AI Assessment Catalog</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2307.03681. doi:https://doi.org/10.48550/arXiv. 2307.03681. arXiv:
          <volume>2307</volume>
          .
          <fpage>03681</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <source>TrustLLM: Trustworthiness in Large Language Models, arXiv preprint arXiv:2401.05561</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2401.05561.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Anthropic</surname>
          </string-name>
          ,
          <article-title>Anthropic's responsible scaling policy</article-title>
          ,
          <year>2023</year>
          . URL: https://www-cdn.
          <source>anthropic.com/ 1adf000c8f675958c2ee23805d91aaade1cd4613/responsible-scaling-policy.pdf.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Scaling laws for neural language models</article-title>
          , arXiv preprint arXiv:
          <year>2001</year>
          .
          <volume>08361</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Brandizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Abdelwahab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhowmick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Helmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denisov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fromm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rutmann</surname>
          </string-name>
          , et al.,
          <article-title>Data Processing for the OpenGPT-X Model Family</article-title>
          ,
          <source>arXiv preprint arXiv:2410.08800</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>Journal of machine learning research 21</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Golding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hoppe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Phang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nabeshima</surname>
          </string-name>
          , et al.,
          <article-title>The Pile: An 800GB dataset of diverse text for language modeling</article-title>
          ,
          <source>arXiv preprint arXiv:2101.00027</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Penedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Malartic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hesslow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cojocaru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cappelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Alobeidli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pannier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Almazrouei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Launay</surname>
          </string-name>
          ,
          <article-title>The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only</article-title>
          ,
          <source>arXiv preprint arXiv:2306.01116</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Chaudhary</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Guzmán</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave,
          <article-title>CCNet: Extracting high quality monolingual datasets from web crawl data</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>00359</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ippolito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nystrom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Carlini</surname>
          </string-name>
          ,
          <article-title>Deduplicating training data makes language models better</article-title>
          ,
          <source>arXiv preprint arXiv:2107.06499</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Abadji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. O.</given-names>
            <surname>Suarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Romary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          ,
          <article-title>Towards a cleaner document-oriented multilingual crawled corpus</article-title>
          ,
          <source>arXiv preprint arXiv:2201.06642</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Penedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kydlíček</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lozhkov</surname>
          </string-name>
          , M. Mitchell,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Von</given-names>
            <surname>Werra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          , et al.,
          <article-title>The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale</article-title>
          ,
          <source>arXiv preprint arXiv:2406.17557</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          , et al.,
          <source>Instruction Tuning for Large Language Models: A Survey</source>
          ,
          <source>arXiv preprint arXiv:2308.10792</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Webson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          , et al.,
          <article-title>The Flan Collection: Designing Data and Methods for Efective Instruction Tuning</article-title>
          , in: International Conference on Machine Learning, PMLR,
          <year>2023</year>
          , pp.
          <fpage>22631</fpage>
          -
          <lpage>22648</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Abadji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J. O.</given-names>
            <surname>Suárez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Romary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ungoliant:</surname>
          </string-name>
          <article-title>An optimized pipeline for the generation of a very large-scale multilingual web corpus</article-title>
          ,
          <source>in: CMLC 2021-9th Workshop on Challenges in the Management of Large Corpora</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zevallos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. O.</given-names>
            <surname>Suarez</surname>
          </string-name>
          ,
          <article-title>Perplexed by quality: A perplexity-based method for adult and harmful content detection in multilingual heterogeneous web data</article-title>
          ,
          <source>arXiv preprint arXiv:2212.10440</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ettinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lambert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dziri</surname>
          </string-name>
          , Wildguard:
          <article-title>Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms</article-title>
          ,
          <source>arXiv preprint arXiv:2406.18495</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dutta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schaefer</surname>
          </string-name>
          , et al.,
          <article-title>DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models.</article-title>
          , in: NeurIPS,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dudík</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Edgar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Horn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lutz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Milan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sameki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <article-title>Fairlearn: A toolkit for assessing and improving fairness in AI, Microsoft</article-title>
          ,
          <source>Tech. Rep. MSR-TR-2020-32</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-K. R. Choo</surname>
          </string-name>
          , Faircompass: Operationalising fairness in
          <source>machine learning</source>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .1109/TAI.
          <year>2023</year>
          .
          <volume>3348429</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mehrabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Morstatter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Galstyan</surname>
          </string-name>
          ,
          <article-title>A survey on bias and fairness in machine learning</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>54</volume>
          (
          <year>2021</year>
          ). URL: https://doi.org/10.1145/3457607. doi:
          <volume>10</volume>
          . 1145/3457607.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-F.</given-names>
            <surname>Ton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guo</surname>
          </string-name>
          , H. Cheng, Y. Klochkov,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Taufiq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models</article-title>
          ' Alignment,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2308.05374. arXiv:
          <volume>2308</volume>
          .
          <fpage>05374</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dathathri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hung</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>P.</given-names>
            <surname>Molino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yosinski</surname>
          </string-name>
          , R. Liu,
          <article-title>Plug and Play Language Models: A Simple Approach to Controlled Text Generation</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv. org/abs/
          <year>1912</year>
          .02164. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1912</year>
          .
          <volume>02164</volume>
          . arXiv:
          <year>1912</year>
          .02164.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gavin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          , et al.,
          <article-title>Lime: Less is more for mllm evaluation</article-title>
          ,
          <source>arXiv preprint arXiv:2409.06851</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-I.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>A unified approach to interpreting model predictions</article-title>
          ,
          <source>in: Proceedings of the 31st International Conference on Neural Information Processing Systems</source>
          , NIPS'17, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2017</year>
          , p.
          <fpage>4768</fpage>
          -
          <lpage>4777</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Binder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Montavon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klauschen</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-R. Müller</surname>
          </string-name>
          , W. Samek,
          <article-title>On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation</article-title>
          ,
          <source>PLOS ONE 10</source>
          (
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>46</lpage>
          . doi:
          <volume>10</volume>
          .1371/journal.pone.
          <volume>0130140</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>C.-Y. Lin</surname>
          </string-name>
          ,
          <article-title>ROUGE: A Package for Automatic Evaluation of Summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://aclanthology.org/W04-1013/.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>BLEU: a Method for Automatic Evaluation of Machine Translation</article-title>
          , in: P.
          <string-name>
            <surname>Isabelle</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Charniak</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          (Eds.),
          <article-title>Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Philadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL: https://aclanthology.org/P02-1040/. doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>T.-W. Weng</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , P.-Y. Chen,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J. Hsieh</surname>
          </string-name>
          , L. Daniel,
          <source>Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach</source>
          ,
          <year>2018</year>
          . URL: https://arxiv. org/abs/
          <year>1801</year>
          .10578. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1801</year>
          .
          <volume>10578</volume>
          . arXiv:
          <year>1801</year>
          .10578.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , G-Eval:
          <article-title>NLG Evaluation using GPT-4 with Better Human Alignment</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2303.16634. doi:
          <volume>10</volume>
          .48550/arXiv.2303. 16634. arXiv:
          <volume>2303</volume>
          .
          <fpage>16634</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>H.</given-names>
            <surname>Inan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Upasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rungta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tontchev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Testuggine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khabsa</surname>
          </string-name>
          , Llama Guard:
          <article-title>LLM-based Input-Output Safeguard for Human-</article-title>
          AI
          <string-name>
            <surname>Conversations</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2312.06674. doi:
          <volume>10</volume>
          .48550/arXiv.2312.06674. arXiv:
          <volume>2312</volume>
          .
          <fpage>06674</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Zamfirescu-Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Parameswaran</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Arawjo</surname>
          </string-name>
          ,
          <article-title>Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences,</article-title>
          <year>2024</year>
          . URL: https://arxiv.org/abs/2404.12272. doi:
          <volume>10</volume>
          .48550/arXiv.2404.12272. arXiv:
          <volume>2404</volume>
          .
          <fpage>12272</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chern</surname>
          </string-name>
          , E. Chern, G. Neubig, P. Liu,
          <article-title>Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/ abs/2401.16788. doi:
          <volume>10</volume>
          .48550/arXiv.2401.16788. arXiv:
          <volume>2401</volume>
          .
          <fpage>16788</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>An Empirical Study of LLM-as-aJudge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2403.02839. doi:
          <volume>10</volume>
          .48550/arXiv.2403.02839. arXiv:
          <volume>2403</volume>
          .
          <fpage>02839</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Choudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Ramayapally</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vaidyanathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hupkes</surname>
          </string-name>
          , Judging the Judges:
          <article-title>Evaluating Alignment and Vulnerabilities in LLMs-as-</article-title>
          <string-name>
            <surname>Judges</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/ 2406.12624. doi:
          <volume>10</volume>
          .48550/arXiv.2406.12624. arXiv:
          <volume>2406</volume>
          .
          <fpage>12624</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seo</surname>
          </string-name>
          , Prometheus: Inducing Fine-grained
          <source>Evaluation Capability in Language Models</source>
          ,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2310.08491. doi:
          <volume>10</volume>
          .48550/arXiv.2310.08491. arXiv:
          <volume>2310</volume>
          .
          <fpage>08491</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <article-title>Asking and Answering Questions to Evaluate the Factual Consistency of Summaries</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <source>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>5008</fpage>
          -
          <lpage>5020</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>450</volume>
          /. doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2020</year>
          .acl-main.
          <volume>450</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lu</surname>
          </string-name>
          , L. Zhu, AgentOps: Enabling Observability of LLM Agents,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2411.05285. arXiv:
          <volume>2411</volume>
          .
          <fpage>05285</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          , Y. Artzi,
          <source>BERTScore: Evaluating Text Generation with BERT</source>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1904</year>
          .09675. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1904</year>
          .
          <volume>09675</volume>
          . arXiv:
          <year>1904</year>
          .09675.
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Peyrard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          , C. M. Meyer, S. Eger,
          <article-title>MoverScore: Text Generation Evaluating with Contextualized Embeddings</article-title>
          and Earth Mover Distance,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1909</year>
          . 02622. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1909</year>
          .
          <volume>02622</volume>
          . arXiv:
          <year>1909</year>
          .02622.
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , G. Neubig, P. Liu, BARTScore: Evaluating Generated Text as Text Generation,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2106.11520. doi:
          <volume>10</volume>
          .48550/arXiv.2106.11520. arXiv:
          <volume>2106</volume>
          .
          <fpage>11520</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiao</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ji</surname>
          </string-name>
          , J. Han,
          <article-title>Towards a Unified MultiDimensional Evaluator for Text Generation</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2210.07197. doi:
          <volume>10</volume>
          . 48550/arXiv.2210.07197. arXiv:
          <volume>2210</volume>
          .
          <fpage>07197</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          , S.
          <article-title>-</article-title>
          <string-name>
            <surname>K. Ng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Jiang</surname>
          </string-name>
          , P. Liu, GPTScore: Evaluate as You Desire,
          <year>2023</year>
          . URL: https://arxiv.org/abs/ 2302.04166. doi:
          <volume>10</volume>
          .48550/arXiv.2302.04166. arXiv:
          <volume>2302</volume>
          .
          <fpage>04166</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>P.</given-names>
            <surname>Manakul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liusie</surname>
          </string-name>
          , M. Gales,
          <article-title>SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>9004</fpage>
          -
          <lpage>9017</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          . emnlp-main.
          <volume>557</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>557</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <surname>AINTrust</surname>
          </string-name>
          , Aixploit,
          <year>2025</year>
          . URL: https://github.com/AINTRUST-AI/aixploit, commit: 9d4c59d2f090b23dd44bbf4df91ae8f1a76f0d20.
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>L.</given-names>
            <surname>Derczynski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Galinkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Inie, garak: A Framework for Security Probing Large Language Models</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2406.11036. doi:
          <volume>10</volume>
          .48550/ arXiv.2406.11036. arXiv:
          <volume>2406</volume>
          .
          <fpage>11036</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>S.</given-names>
            <surname>Morales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Clarisó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cabot</surname>
          </string-name>
          ,
          <article-title>A dsl for testing llms for fairness and bias</article-title>
          , in: Proceedings of the
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>