<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Metrics: an Investigation into the Reliability of Evaluation Metrics for Domain Specific Graph-based Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lia Draetta</string-name>
          <email>lia.draetta@unito.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Antonio Stranisci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Flaviana Corallo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pier Felice Balestrucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Oliverio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rossana Damiano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Mazzei</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Question Answering, Knowledge Graph, Human-in-the-Loop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Turin</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>1</fpage>
      <lpage>8</lpage>
      <abstract>
        <p>Recently, knowledge graph-based approaches have gained wider adoption across domains thanks to their ability to enhance explainability and reduce hallucination in domain specific tasks. Although graph-based architectures have shown promising results, however, evaluation remains an open issue due to the complexity of the analysis and the inherent subjectivity and variability involved when it comes to practical use scenarios and stakeholders' needs. In this context, we present GRADES (Graph-based Reliability Assessment of Domain-specific Evaluation Systems) an evaluation framework for graph-based question answering. To investigate the reliability of the current state-of-the-art evaluation strategies we insert both automatic and qualitative human-based evaluation at each step (information extraction, entity linking and verbalization) of a reference graph-based QA pipeline. At the final step domain experts are engaged to asses both correctness and soundness of the verbalized output. We apply the pipeline and evaluation framework to a case study in the literary domain, showing that the punctual evaluation of each step is able to highlight the limits of of-the-shelf tools in a practical use case.</p>
      </abstract>
      <kwd-group>
        <kwd>Graph-based</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a
wide range of tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Despite these advancements, several scholars have begun to highlight the
limitations of such models across multiple levels. For instance, LLMs outputs have been shown to
sufer from issues such as hallucinations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], outdated knowledge [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and a lack of domain-specific
expertise [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Additionally, studies have pointed out that LLMs often reflect societal, cultural biases
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and under-represent marginalized groups [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These limitations undermine the explainability and
trustworthiness of LLMs, particularly when applied to niche domains [
        <xref ref-type="bibr" rid="ref2">7, 2, 8</xref>
        ]. In this context,
graphbased retrieval-augmented generation (RAG) approaches, and knowledge graph-based methods more
broadly, have shown promising results across various tasks [9, 10, 11]. These approaches have emerged
as efective strategies to mitigate the aforementioned limitations by incorporating external knowledge
while leveraging the capabilities of LLMs [12, 13], demonstrating strong potential in various fields
[14, 15], particularly in areas that require precise and up-to-date information and involve specialized
knowledge such as question answering (QA). Despite the growing interest and the potential of
graphbased QA, however, the lack of a standardized evaluation framework remains an open research challenge
[16, 17].
      </p>
      <p>In this regard, several studies have stressed the importance of integrating traditional quantitative
metrics with human feedback, particularly in narrative domains such as literature or digital humanities,
to enhance explainability and ensure output reliability [7, 18]. Additional challenges emerge when
LLM-generated outputs are evaluated, particularly since scholars have questioned the validity of using
LLMs to assess their own responses [19, 20]. These concerns underscore the need for new proposals in
the context of hybrid evaluation methodologies that combine human and automatic feedback. Finally,</p>
      <p>CEUR</p>
      <p>ceur-ws.org
it is important to note that since graph-based QA pipelines are inherently multi-component, relying on
a single evaluation strategy carried out on the final output may be overly restrictive.</p>
      <p>Having in mind the potential of graph-based QA and the challenges that evaluation sets in this
domain (e.g., lack of human involvement, poor reliability on rare entities, lack of multi-step evaluation),
we developed GRADES (Graph-based Reliability Assessment of Domain-specific Evaluation Systems), a
multi-step evaluation framework that combines automatic and human-generated metrics deployed on a
reference graph-based pipeline1. Aware of the potential of a collaborative approach [21, 22], we propose
a methodology in which evaluation is performed separately in every step (e.g. information extraction,
entity linking, triple extraction); for each step an evaluation is provided from both quantitative and
qualitative, human-based point of view.</p>
      <p>The goal of this approach is to deliver an evaluation framework able to face the still open challenge
of integrating quantitative metrics about the extracted and linked entities, and a more qualitative
evaluation by domain experts on the soundness and completeness of the answer. Finally, aiming to
assess the efectiveness of the methodology, we present a case study in which it is applied to a QA
pipeline in the literary domain, selected due to its large size and its capability to encompass mainstream
as well as rare entities. This case study highlights how the characteristics of current tools impact the
various stages of the pipeline, significantly hindering the final outcome. At the same time, however, it
indicates some research directions to integrate graphs and LLMs more efectively in a pipeline tailored
to the needs of the reference communities.</p>
      <p>The paper is organized as follows: in Section 2, we review the main works on graph-based question
answering approaches and their evaluation, highlighting the open challenges in the field. Section
3 provides a detailed description of the pipeline and the evaluation framework. The case study is
presented in Section 4. In sections 5 and 6 we respectively present the conclusions and discuss the
limitations of our work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>Leveraging information retrieved from knowledge graphs to reduce factual errors in various LLM-based
tasks has become increasingly common in recent years [23, 13, 24]. Graph-based tasks are typically
implemented as a multi-stage pipelines comprising steps such as entity extraction, graph retrieval, triple
verbalization, and prompt tuning, depending on the specific methodology employed. In the context of
QA, several graph-based approaches have been proposed recently [25, 26, 27, 28], each leveraging the
structural advantages of graphs, such as nodes and their relations, in distinct ways.</p>
      <p>In terms of evaluation, Graph-RAG approaches are commonly assessed on question answering
benchmarks using metrics such as F1 score, accuracy, and recall [29, 26] or leveraging lexical similarity
metrics [30] such as ROUGE [31] or BLEU [32]. While several benchmarks for evaluating question
answering tasks are now available [33, 34], the evaluation of graph-based methods remains an open
challenge [35].</p>
      <p>Recent studies [36, 35] have highlighted the limitations of automated evaluation methods, highlighting
their poor correlation with human judgment [37] and their fallacy in capturing factuality or faithfulness
issues in text [38]. Concurrently, other works [39, 40] are beginning to emphasize the potential of
hybrid approaches that combine the strengths of LLMs with human judgment, aiming to balance validity
and reliability while minimizing data requirements. These studies emphasize that dimensions such
as correctness, clarity, and informativeness are not objective and are often interpreted diferently by
human evaluators and automated systems. In this context, some recent studies positively adopted a
hybrid approach to verify the output of a RAG pipeline. Yu et al. [41] successfully integrate human
judgment in the evaluation process of their RAG pipeline to ensure the reliability and robustness of
their results. Gienapp and colleagues [42] highlight the limitations of using LLMs to evaluate
LLMgenerated responses and, consequently, advocate for the validity of human evaluation. To this end, they
propose a crowdsourcing-based methodology for the evaluation of RAG systems. While collaborative
1code available at: https://anonymous.4open.science/r/Talk-5AC5/README.md
methodologies, well-established in other fields such as human-in-the-loop [ 43, 44], appear promising,
they remain underexplored and relatively novel within the graph-based QA domain.</p>
      <p>Moreover, when assessing the generated answers, multiple dimensions must be considered, such as
correctness, coherence, completeness, and alignment with the actual needs of stakeholders. Recent
studies have proposed LLM-based evaluation frameworks. For example, Es et al. [45] introduce RAGAs,
a framework for evaluating Retrieval-Augmented Generation pipelines without reference data. Their
methodology assesses Faithfulness, Answer Relevance, and Context Relevance through a multi-step
LLM-based evaluation that takes the question and generated answer as input. While the framework is
promising, the authors acknowledge its limitations, as it relies heavily on the performance of the LLMs
used for evaluation. In addition, since LLMs are known to struggle with handling rare entities [46, 47],
basing the evaluation entirely on such models may yield less reliable results.</p>
      <p>In our work, acknowledging the current challenges in evaluation, we aim, on one hand, to assess the
reliability of state-of-the-art tools for the diferent steps of the pipeline, and on the other, to propose a
framework that integrates automatic metrics with human judgment.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Pipeline and Evaluation Design</title>
      <p>To test our comprehensive evaluation framework, we developed a graph-based QA pipeline leveraging
open-source, state of the art tools. The pipeline comprises several components, including information
extraction, entity linking, triple extraction, and triple verbalization. Given a user query, the pipeline
ifrst extracts target entities using an LLMs-based approach, and then links these entities to those present
in knowledge bases. Subsequently, relevant nodes are retrieved from a knowledge graph using SPARQL
queries. A pruned subgraph is constructed, and the most relevant triples are verbalized through
finetuned LLMs. Figure 1 illustrates the pipeline workflow from user query to system-generated answer
and the evaluation steps provided for each phase.</p>
      <sec id="sec-3-1">
        <title>3.1. Information Extraction</title>
        <p>The first step of the pipeline is an Information Extraction (IE) task, aimed at extracting relevant entities
and entity types from a question formulated in natural language. However, with respect to a standard
Named Entity Recognition task, this step includes also generic mentions of entity types to drive the
search in the KG. For instance, consider the sentence “Which are the Italian women who won the
Nobel Prize?” where it is possible to identify not only two named entities, but also a demonym (Italian)
and a prize (Nobel) respectively, and the type women, which is relevant for retrieval.
Baseline. We use as a baseline for this step NuExtract-1.52 [48], an open-source, lightweight,
text-to-JSON fine-tuned large language model designed to extract complex information from text and
organize it into structured data. We adopted the small version of NuExtract (NuExtract-tiny), as it
demonstrates remarkable performance even in zero-shot settings when compared to state-of-the-art
models, despite its significantly smaller size. Furthermore, prioritizing an open-source solution was
essential to ensure the reproducibility of our results and resources. The task consists in completing a
predefined template containing empty entity slots with their classes based on the input question. The
template was defined by selecting the top-level classes of the leveraged knowledge-base:
Input: "entities": {
"work": [],
"person": [],
"subject":[],
"publisher":[]
}
2https://huggingface.co/numind/NuExtract-1.5</p>
        <p>Evaluation Metrics. In this phase, the evaluation consists of calculating the precision and recall and
F1 score of the information extracted by the baseline model against a manually annotated gold standard
labeled by domain experts. The manual creation of the gold standard by two researchers enables precise
assessment of the baseline model’s performance and supports qualitative analysis.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Entity Linking</title>
        <p>The second step of the pipeline is Entity Linking (EL), aimed at linking the extracted knowledge to the
corresponding entities in the KG. Since our aim is to evaluate the overall performance of a system in
interacting with structured knowledge, the EL phase takes as input the knowledge extracted in the
previous phase. Models are fed with entities and their classes and must return the top-k candidates in
the KG and all the triples where they appear as subject or object.</p>
        <p>Baseline. Our baseline for this step is a vector-based similarity search carried out by adopting two
distinct embedding models: sentence-transformers/ all-MiniLM-L6-v2 [49] and
GIST-Embeddingv0 [50]. Each input entity, previously extracted from a text and associated with its semantic class (e.g.,
Work, Person), is encoded into a vector representation using both embedding models. These vectors
are then used to query a pre-built FAISS 3 index corresponding to the entity type. Each index stores
vectorized representations of known entities, enabling eficient nearest neighbor search. Given a query
entity, the index returns the top-k most similar candidates based on vector similarity. The retrieved
entities are identified via their FAISS index positions and then resolved to their corresponding names
and labels using a lookup table stored in Parquet format4, an open source data file format that enables
3https://github.com/facebookresearch/faiss
4https://parquet.apache.org/
eficient data storage and retrieval. All linked entities derived from the output template are aggregated
into a unified list representing the complete set of candidate links for the input entities. Finally, the
top-ranked entities are used to retrieve triples from the knowledge graph. This is accomplished via
a SPARQL query that extracts all triples from the knowledge base where the target entity appears as
object or subject.</p>
        <p>SELECT DISTINCT ?label ?p WHERE {{
BIND ({entity} as ?ent) .</p>
        <p>{{?s ?p ?ent}} UNION {{?ent ?p ?s</p>
        <p>}} .</p>
        <p>?s rdfs:label ?label .</p>
        <p>}}
Evaluation. This evaluation phase involves manual validation by domain experts. Specifically, the
top ten entities linked to each input question are assessed for relevance. Average precision is then
computed to quantify the proportion of relevant entities, based on how many of the ten retrieved
entities are pertinent. This evaluation serves a dual purpose: it enables a comparative performance
analysis of the two embedding models used in the entity linking process and provides insights into the
types of entities that are most challenging to link.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Triples verbalization</title>
        <p>Finally, verbalization of the extracted triples is performed, aimed at converting the RDF data in natural
language sentences. This step is crucial to assess to what extent a model is able to recognize the
semantics of triples in their verbalization.</p>
        <p>Baseline. For the triples verbalization, inspired by Oliverio and colleagues [51], three diferent midsize
open-weight English LLMs were fine-tuned: LLaMA 3.1 8B Instruct5, Qwen 2.5 7B Instruct [52],
and Mistral-Nemo-Instruct-24076. The fine-tuning phase was performed using the WebNLG corpus
[53], a linguistic resource consisting of data units, each represented as a set of RDF triples (subject,
predicate, object) extracted from 15 distinct DBpedia categories. Each data unit is accompanied by one
or more human-written verbalizations produced by expert annotators. For the experiments, we adopted
WebNLG 3.0, which was released during the WebNLG 2020 Challenge.7 The dataset was split into
train, dev, and test sets, with each data unit containing between 1 and 7 RDF triples. The parameters
used to fine-tune the models are shown in Table 1. After fine-tuning, we used the models to generate
verbalizations for all the extracted RDF triples.</p>
        <p>Evaluation. To evaluate the generated outputs, we structured a two-fold strategy. On one hand,
aiming to assess the completeness and the overall correctness of the produced sentences, we followed
the taxonomy proposed by Kasner and Dušek [54] and manually assessed 50 randomly sampled triple
verbalizations per model. The taxonomy classifies errors into four categories: Incorrect, where the text
contradicts the data; Not Checkable, where the information cannot be verified; Misleading, where the
text is deceptive given the context or the information is missing; and Other, for problematic cases not
iftting the other categories.</p>
        <p>On the other hand, to assess the relevance of the generated answers and their validity from a
domainspecific perspective, a separate annotation task was conducted. Subject matter experts were asked to
judge the triples, using a structured evaluation template to asses if the output was pertinent with the
QLoRA parameters
LoRA attention dimension
Alpha parameter
Dropout probability
bitsandbytes parameters
Activate 4-bit precision
Compute dtype for 4-bit
Quantization type
Activate nested quantization
TrainingArguments parameters
Number of training epochs
Enable fp16 training
Enable bf16 training
Batch size per GPU for training
Batch size per GPU for evaluation
Gradient accumulation steps
Maximum gradient norm
Initial learning rate
Weight decay
Optimizer
Learning rate schedule
Warmup ratio</p>
        <p>Value
input question. This phase of the evaluation goes beyond traditional quantitative metrics by determining
whether the extracted, linked, and verbalized information is not only technically correct but also useful
in a real-world scenario, providing insight into the applicability of the system’s outputs.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Case Study: QA for a Specialized Literary Domain</title>
      <p>To assess the efectiveness of the proposed pipeline, we present a case study in which the evaluation
framework is applied in practice. The leveraged knowledge base is the World Literature Knowledge
Graph [55], a collection of writers and works (10.8 million of type human) derived from Wikidata,
Goodreads, and Open Library, created to detect and contrast cultural underrepresentation through the
integration of minor literary archives from selected research projects. Given the well-known limitations
of LLMs in handling entities from niche or specialized domains, this case study focuses on the literary
domain, and on less-represented, minority literary traditions. The rationale behind this choice is
to evaluate the pipeline in a scenario where domain-specific knowledge is critical but potentially
underrepresented in general-purpose language models. The case study consists of an initiative aimed
to foster interdisciplinary research on cultural heritage through a networked representation of writers,
literary works, and places. To ensure that the case study emphasizes underrepresented entities while
maintaining relevance for domain experts, the evaluation focused on a set of specific domains from
ongoing research projects (Travel and literature in the French-speaking World, Pyrenees in sounds and
pictures, Transylvania and the Banat in British travel writing), and was developed in collaboration with
the scholars involved in these projects.</p>
      <sec id="sec-4-1">
        <title>4.1. Case study: Input Question</title>
        <p>Recognizing the value of collaborative and participatory approaches in research design [21], we engaged
domain experts and collaborators during the initial phase of input design. First, a research team proposed
a set of questions in natural language; these questions were subsequently reviewed and validated by
two scholars from the foreign literature department.</p>
        <p>This process resulted in a final benchmark of 15 diferent questions to evaluate whether more intricate
queries pose greater challenges for the pipeline. Specifically, four questions focused on identifying
narratives set in specific geographic regions (e.g., “Find travel books that are set in the Alps”), five
combined author’s origin and book setting (e.g., “Find books written by French authors and located in
Morocco”), four combined the language of the text and the location it is about (e.g., “Find books written
in French that talk about the Pyrenees”), and two posed highly specific questions related to books about
locations in Romania (e.g., “Find books about Casa Mureșenilor in Brașov”).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Case study: Information Extraction</title>
        <p>For each input question, the model was provided with the following template reflecting the semantics
of the literary domain:
{"work": [], "subject": [], "publisher": [], "person": [],</p>
        <p>"language": [], "date": [], "place": [] }}</p>
        <p>The task consists of extracting relevant entities from the input sentence and populating the
corresponding fields of the predefined template. For the evaluation a gold standard dataset was manually
annotated by two expert annotators using the same set of input questions provided to the models (see
section 4.1). Annotators were asked to fill the entity templates based solely on the question. Finally,
evaluation metrics were computed by comparing the output of NuExtract against the annotated gold
standard. Precision and recall were first calculated for each input question, and then averaged to obtain
the overall metrics. We report a macro-averaged precision of 0.47, a macro-averaged recall of 0.69, and
a macro-averaged F1 score of 0.56. (Results are presented in Table 2.). While the model demonstrates
the ability to extract the relevant entities (Recall) from the input sentence, it often struggles to correctly
assign them to the appropriate category (Precision).</p>
        <p>From a qualitative perspective, it was observed that the entity extraction task appears to be more
challenging for simpler questions involving fewer entities. This may be attributed to the model’s
tendency to fill all available slots in the given template, even when fewer relevant entities are present.
In such cases, recall tends to be high, since most relevant entities are retrieved, but precision is relatively
low, as the extracted entities are often incorrectly associated with the target classes.</p>
        <p>These findings indicate that this step requires model capabilities beyond those addressed by standard
information extraction (IE) tasks. For example, extracting information from a query like “Find all books
that talk about Marseille” goes beyond traditional Named Entity Recognition (NER), as “book” represents
a generic mention rather than a named entity. Moreover, the models demonstrate dificulty in handling
rare or less frequent entity types. While they correctly associate “Marseille” with location-related types,
they often fail to do so with entity type as “Publisher”.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Case Study: Entity Linking</title>
        <p>Following the proposed methodology (3.2) the extracted entities were vectorized, and the k-most similar
entities were retrieved using a pre-built FAISS index corresponding to each entity type in the knowledge
graph. The evaluation was conducted on the top 10 retrieved entities, with two annotators assessing
the relevance of each result. Annotators were asked to indicate how many out of the ten entities were
relevant to the input question. They were presented with a list of input questions (e.g. Find travel books
that are set in the Alps) each associated with a ranked list of linked entities retrieved by the model
(e.g. {‘entity’: ‘urb:urb_subject_6978’, ‘label’: ‘alps’, ‘distance’: 4.0611733268837436e-13}), and they were
asked to decide how many of the top 10 retrieved entities are relevant to the input question.</p>
        <p>Information Extraction
Metric
Macro Precision
Macro Recall
Macro F1
Entity Linking
SBERT Avg Precision
GIST Avg Precision</p>
        <p>The results of the human evaluation were obtained by calculating the Average Precision at 10 over the
entire set of input questions. The two models, SBERT and GIST, demonstrated comparable performance,
achieving average precision scores of 0.326 and 0.32, respectively (see results in Table 2). Both models
tended to perform better on the input that showed higher precision and recall in the entity extraction
task, suggesting that entity linking performance is partially based on the quality of the extracted entities.</p>
        <p>The findings highlight that vector similarity correlates closely with surface-level textual similarity,
frequently resulting in the linking of entities with similar names or spellings. Furthermore, our results
highlight a persistent gap in the field of entity linking: current state-of-the-art models, such as Relik [ 56],
while demonstrating strong overall performance, are typically trained on a limited set of taxonomies
(e.g., Wikidata). As a result, they are not easily transferable to other knowledge graphs and exhibit
limitations in handling long-tail entities [57].</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Case study: Triple Verbalization</title>
        <p>The ten top ranked entities per question are then used to extract triples from the KG. After fine-tuning,
a model to generate verbalizations for all RDF triples corresponding to the 15 target questions is used.
This process resulted in 358 verbalizations. For each question, the corresponding verbalizations were
concatenated. Two of the authors manually evaluated the generations and achieved an agreement of
0.98 for Krippendorf’s alpha and 0.96 for Cohen’s kappa, indicating an almost perfect agreement. Table
3 shows the results of this manual evaluation.</p>
        <p>All three models exhibit a preference for passive constructions even when the original predicate
is active. For instance, the triple Alps, has topic, Melissa Hill is verbalized by Llama 3.1 8B as
“Melissa Hill is a topic related to the Alps’’. Among the most recurrent error types are those concerning
named entities, which are often modified, e.g., through alterations of book titles or geographic names.
Notably, the most frequent and systematic error across all three models involves triples with the
predicate publishing language, which are regularly erroneously verbalized due to confusion between
subject and object or through incorrect paraphrasing. For example, the triple French, publishing
language, Rosa Montero is incorrectly verbalized by Mistral-Nemo-Instruct-2407 as “Rosa Montero
Task_1
Task_2
is a French publisher.”. This behavior may be attributed to the fact that, during training, the models were
never exposed to verbalizations involving this predicate. Importantly, the models almost never exhibit
hallucination or omission phenomena. These results suggest that, unexpectedly, the employed models
lack awareness of entity types. For example, the triple Alps has topic Messner was verbalized as
“Reinhold Messner is a topic of the Alps”, indicating that the model failed to recognize that Messner is a
person (i.e., an agent) and the Alps is a location, more plausibly the subject of a novel or an author’s
interest.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Case study: Domain Expert Validation</title>
        <p>To validate the extracted triples and their corresponding verbalizations, we involved two researchers
from the Department of Foreign Literature. Their role was to evaluate the reliability and relevance of
the extracted content. For each instance, the annotators were provided with the input question, the raw
triples, and their verbalized versions (some examples are presented in Table 5). They were then asked to
answer two evaluation questions: In the context of the input question, are the raw data or the verbalized
triple more explanatory? (task_1), Is the extracted knowledge pertinent to the input question? (task_2).</p>
        <p>The opinions of the interviewed experts were not unanimous, and they encountered some dificulties
in carrying out the evaluation. The assessment results were poorer than expected, as the majority of
outputs were labeled as non-pertinent. This outcome is likely attributable to the limited performance
observed in the initial steps of the pipeline.</p>
        <p>To assess the dificulty of the annotation tasks and the consistency between annotators, we calculated
the inter-annotator agreement (IAA) for both tasks (Table 4). We obtain a Cohen’s K of 0, 315 for Task_1
and 0.154 for task_2, indicating slight agreement. To further examine the outputs of the verbalization
model in Task_1, a third expert was consulted to resolve cases of disagreement and provide the final
judgment.</p>
        <p>Out of 48 analyzed examples in 29 cases the annotators judged the verbalized form as more informative
than the raw triple representation. However, the analysis revealed that the verbalization model exhibited
limited semantic understanding, particularly in interpreting the meaning of relations within the triples.
For instance, the input triple books has topic This is Not the End of the Book, which refers
to the work by Umberto Eco and Jean-Claude Carrière, was verbalized as “This is Not the End of the
Book is a topic in books”. In contrast, a more common input such as books has topic Mark Twain's
Burlesque Autobiography was correctly verbalized as “Mark Twain’s Burlesque Autobiography is a
book.” This suggests that the model performs better on more straightforward or lexically familiar inputs.
Additionally, it appears that verbalization models face greater challenges when dealing with rare or
underrepresented entities. To assess the difusion and representation of these entities, we examined
their presence in Wikidata. We found that the entity “Mark Twain’s (Burlesque) Autobiography and
First Romance” (wd:Q6770070) is associated with 47 triples, whereas “This is Not the End of the Book”
is not present in Wikidata at all. This disparity may indicate that LLMs struggle with handling rare or
low-resource entities across a variety of tasks.
Find travel books that
are set in the Alps
Find books written by
French and located in
Morocco
Find Romanian books
that talk about Banat</p>
        <p>Triple
alps has topic Mont Blanc
aux sept vallées
alps has topic Melissa Hill
French publishing language
Rosa Montero
morocco has topic Abdallah
Laroui
Romanian publishing
language Panait Istrati
Romanian publishing
language Miodrag Bulatović</p>
        <p>Verbalization
Melissa Hill is a topic related
to the Alps.</p>
        <p>Melissa Hill is a topic related
to the Alps.</p>
        <p>Rosa Montero writes in
French.</p>
        <p>Abdallah Laroui is a topic of
Morocco.</p>
        <p>Panait Istrati is a Romanian
author.</p>
        <p>AMiodrag Bulatović is a
Romanian publisher.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Works</title>
      <p>In this work, we presented a human-centered framework for the evaluation of graph-based question
answering. We developed a graph-based QA system composed of discrete, interpretable steps, and
conducted both quantitative and qualitative evaluations for each phase. Recognizing the value of
collaborative approaches, we involved domain experts in the initial phase to validate the soundness of
the input questions, as well as in the overall evaluation phase to assess the relevance and quality of
the system’s outputs. Our focus was particularly centered on the evaluation phase, as the literature
indicates that determining how, and through which parameters, exhaustively evaluate such systems
remains an open challenge. We first observed that tools based on LLMs often struggle with handling
structured knowledge, and that without human supervision, such as evaluation through gold standards
or manual checks, they do not appear to be suitable for exhaustively completing knowledge
graphrelated tasks. The evaluation of the system’s output conducted by domain experts provides additional
insights on how LLMs struggle to handle rare and underrepresented entities. This issue, which is of
high relevance within the AI community, becomes even more critical in the context of graph-based
frameworks. Since many pipelines that incorporate external knowledge bases are specifically designed
to mitigate the limitations of LLMs, relying solely on LLMs to evaluate such systems is at risk of creating
a self-reinforcing loop. As recently argued by several scholars [20, 19], substituting human judgment
with LLM-based assessments may be an overly simplistic and potentially misleading approach.</p>
      <p>Our work highlights the importance of integrating intermediate evaluation steps into the pipeline
and demonstrates that human evaluation remains a crucial component, particularly when assessing the
ifnal output. We argue that the outputs generated by such models must be validated by domain experts,
as they are ultimately responsible for determining the utility and relevance of the information provided.
In addition, our findings highlight the limitations of current models when used for handling semantic
data, emphasizing its inherent complexity. Through qualitative evaluation, we provide updated insights
into ongoing challenges and ofer perspectives on how these models could be improved.</p>
      <p>This work while proposing initial metrics and reflections on the role of evaluation, paves the way to
several future works. First, we plan to develop a large-scale evaluation framework involving scholars
and experts from diverse disciplines to ensure broader and more robust validation. Second, in the
context of verbalization, we intend to enhance the model with additional semantic information, such as
the top-level classes of the involved entities, and assess the impact of this information on the quality
and accuracy of the generated verbalizations. In conclusion, by proposing an analysis and evaluation
framework that extends beyond traditional metrics, this study provides a deeper understanding of
the current challenges associated with the various phases of graph-based question answering and its
evaluation.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Limitations</title>
      <p>The present study ofers an in-context evaluation of tools introduced at various stages of a graph-based
question answering pipeline. While the limited number of tools evaluated could be seen as a limitation,
we intentionally selected state-of-the-art models to provide up-to-date metrics and linked qualitative
considerations. Additionally, the use of a single case study could be not exhaustive, however, we chose
this niche area because it aligns with ongoing domain-specific projects that are of interest to various
scholars collaborating on larger initiatives.</p>
      <p>In terms of generalizability, the objective of this study was to develop a infrastructure for evaluating
diferent models at various stages, and across diverse knowledge graphs. In a research landscape,
where assessing the reasoning abilities of LLMs remains an open challenge, we argue that providing a
lfexible and domain-independent framework is a critical step toward understanding how these models
perform on specific tasks, topics, and rare entities. While a limitation of this work is its focus on a single
case study, the insights gained from both our qualitative and quantitative analyses inform not only
future enhancements to the evaluation pipeline, but also broader discussions on the socio-technical
implications of deploying such systems in domain-specific contexts.</p>
      <p>Finally, the involvement of a small number of field experts means that the results may not be fully
representative. As part of future work, we plan to conduct a large-scale analysis involving more experts
and assess a broader range of input questions.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[7] V. Armant, A. Mouakher, F. Vargas-Rojas, D. Symeonidou, J. Guérin, I. Mougenot, J.-C. Desconnets,
Can knowledge graphs and retrieval-augmented generation be combined to explain query/answer
relationships truthfully?, in: DAO-XAI 2024 Data meets Ontologies in Explainable AI co-located
with the 27th European Conference on Artificial Intelligence (ECAI 2024), volume 3833, 2024.
[8] R. Jia, B. Zhang, S. J. R. Méndez, P. G. Omran, Leveraging large language models for semantic
query processing in a scholarly knowledge graph, arXiv preprint arXiv:2405.15374 (2024).
[9] D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, D. Metropolitansky, R. O. Ness,
J. Larson, From local to global: A graph rag approach to query-focused summarization, arXiv
preprint arXiv:2404.16130 (2024).
[10] Z. Xu, M. J. Cruz, M. Guevara, T. Wang, M. Deshpande, X. Wang, Z. Li, Retrieval-augmented
generation with knowledge graphs for customer service question answering, in: Proceedings
of the 47th International ACM SIGIR Conference on Research and Development in Information
Retrieval, 2024, pp. 2905–2909.
[11] W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, Q. Li, A survey on rag meeting llms:
Towards retrieval-augmented large language models, in: Proceedings of the 30th ACM SIGKDD
conference on knowledge discovery and data mining, 2024, pp. 6491–6501.
[12] T. Vu, M. Iyyer, X. Wang, N. Constant, J. Wei, J. Wei, C. Tar, Y.-H. Sung, D. Zhou, Q. Le, T. Luong,
FreshLLMs: Refreshing large language models with search engine augmentation, in: L.-W. Ku,
A. Martins, V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL
2024, Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 13697–13720. URL:
https://aclanthology.org/2024.findings-acl.813/. doi:10.18653/v1/2024.findings-acl.813.
[13] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, H. Wang, H. Wang, Retrieval-augmented
generation for large language models: A survey, arXiv preprint arXiv:2312.10997 2 (2023).
[14] S. Aghaei, E. Raad, A. Fensel, Question answering over knowledge graphs: A case study in tourism,</p>
      <p>IEEE Access 10 (2022) 69788–69801.
[15] A. Tauqeer, I. Hammid, S. Aghaei, P. Parvin, E. M. Postma, A. Fensel, Smell and taste disorders
knowledge graph: answering questions using health data, Expert Systems with Applications 234
(2023) 121049.
[16] D. Galla, S. Hoda, M. Zhang, W. Quan, T. D. Yang, J. Voyles, Courage: A framework to evaluate
rag systems, in: International Conference on Applications of Natural Language to Information
Systems, Springer, 2024, pp. 392–407.
[17] S. Simon, A. Mailach, J. Dorn, N. Siegmund, A methodology for evaluating rag systems: A case
study on configuration dependency validation, arXiv preprint arXiv:2410.08801 (2024).
[18] E. Kamalloo, A. Jafari, X. Zhang, N. Thakur, J. Lin, Hagrid: A human-llm collaborative dataset for
generative information-seeking with attribution, arXiv preprint arXiv:2307.16883 (2023).
[19] C. L. Clarke, L. Dietz, Llm-based relevance assessment still can’t replace human relevance
assessment, arXiv preprint arXiv:2412.17156 (2024).
[20] I. Soborof, Don’t use llms to make relevance judgments, Information retrieval research journal 1
(2025) 10–54195.
[21] F. Delgado, S. Yang, M. Madaio, Q. Yang, The participatory turn in ai design: Theoretical
foundations and the current state of practice, in: Proceedings of the 3rd ACM Conference on Equity and
Access in Algorithms, Mechanisms, and Optimization, 2023, pp. 1–23.
[22] R. Sapkota, S. Raza, M. Karkee, Comprehensive analysis of transparency and accessibility of
chatgpt, deepseek, and other sota large language models, arXiv preprint arXiv:2502.18505 (2025).
[23] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances
in neural information processing systems 33 (2020) 9459–9474.
[24] Y. Tang, Y. Yang, Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop
queries, arXiv preprint arXiv:2401.15391 (2024).
[25] D. Taunk, L. Khanna, S. V. P. K. Kandru, V. Varma, C. Sharma, M. Tapaswi, Grapeqa: Graph
augmentation and pruning to enhance question-answering, in: Companion Proceedings of the
ACM Web Conference 2023, 2023, pp. 1138–1144.
[26] J. Zhang, X. Zhang, J. Yu, J. Tang, J. Tang, C. Li, H. Chen, Subgraph retrieval enhanced model for
multi-hop knowledge base question answering, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.),
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 5773–5784.</p>
      <p>URL: https://aclanthology.org/2022.acl-long.396/. doi:10.18653/v1/2022.acl- long.396.
[27] C. Mavromatis, G. Karypis, Gnn-rag: Graph neural retrieval for large language model reasoning,
arXiv preprint arXiv:2405.20139 (2024).
[28] J. Kim, Y. Kwon, Y. Jo, E. Choi, KG-GPT: A general framework for reasoning on knowledge graphs
using large language models, in: H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Association for
Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore,
2023, pp. 9410–9421. URL: https://aclanthology.org/2023.findings-emnlp.631/. doi:10.18653/v1/
2023.findings- emnlp.631.
[29] J. Sun, C. Xu, L. Tang, S. Wang, C. Lin, Y. Gong, L. M. Ni, H.-Y. Shum, J. Guo, Think-on-graph:
Deep and responsible reasoning of large language model on knowledge graph, arXiv preprint
arXiv:2307.07697 (2023).
[30] P. Schmidtova, S. Mahamood, S. Balloccu, O. Dusek, A. Gatt, D. Gkatzia, D. M. Howcroft, O. Platek,
A. Sivaprasad, Automatic metrics in natural language generation: A survey of current evaluation
practices, in: S. Mahamood, N. L. Minh, D. Ippolito (Eds.), Proceedings of the 17th International
Natural Language Generation Conference, Association for Computational Linguistics, Tokyo,
Japan, 2024, pp. 557–583. URL: https://aclanthology.org/2024.inlg-main.44/. doi:10.18653/v1/
2024.inlg- main.44.
[31] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization
Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL:
https://aclanthology.org/W04-1013/.
[32] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
translation, in: Proceedings of the 40th annual meeting of the Association for Computational
Linguistics, 2002, pp. 311–318.
[33] J. Chen, H. Lin, X. Han, L. Sun, Benchmarking large language models in retrieval-augmented
generation, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 2024,
pp. 17754–17762.
[34] R. Friel, M. Belyi, A. Sanyal, Ragbench: Explainable benchmark for retrieval-augmented generation
systems, arXiv preprint arXiv:2407.11005 (2024).
[35] H. Yu, A. Gan, K. Zhang, S. Tong, Q. Liu, Z. Liu, Evaluation of retrieval-augmented generation: A
survey, in: CCF Conference on Big Data, Springer, 2024, pp. 102–120.
[36] B. Peng, Y. Zhu, Y. Liu, X. Bo, H. Shi, C. Hong, Y. Zhang, S. Tang, Graph retrieval-augmented
generation: A survey, arXiv preprint arXiv:2408.08921 (2024).
[37] J. Novikova, O. Dušek, A. Cercas Curry, V. Rieser, Why we need new evaluation metrics for NLG, in:
M. Palmer, R. Hwa, S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark,
2017, pp. 2241–2252. URL: https://aclanthology.org/D17-1238/. doi:10.18653/v1/D17- 1238.
[38] S. Gehrmann, E. Clark, T. Sellam, Repairing the cracked foundation: A survey of obstacles in
evaluation practices for generated text, Journal of Artificial Intelligence Research 77 (2023) 103–166.
[39] S. Tsaneva, D. Dessì, F. Osborne, M. Sabou, Knowledge graph validation by integrating llms and
human-in-the-loop, Information Processing &amp; Management 62 (2025) 104145.
[40] G. Faggioli, L. Dietz, C. L. Clarke, G. Demartini, M. Hagen, C. Hauf, N. Kando, E. Kanoulas,
M. Potthast, B. Stein, et al., Perspectives on large language models for relevance judgment, in:
Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval,
2023, pp. 39–50.
[41] H. Q. Yu, F. McQuade, Rag-kg-il: A multi-agent hybrid framework for reducing hallucinations
and enhancing llm reasoning through rag and incremental knowledge graph learning integration,
arXiv preprint arXiv:2503.13514 (2025).
[42] L. Gienapp, T. Hagen, M. Fröbe, M. Hagen, B. Stein, M. Potthast, H. Scells, The viability of
crowdsourcing for rag evaluation, arXiv preprint arXiv:2504.15689 (2025).
[43] E. Mosqueira-Rey, E. Hernández-Pereira, D. Alonso-Ríos, J. Bobes-Bascarán, Á. Fernández-Leal,
Human-in-the-loop machine learning: a state of the art, Artificial Intelligence Review 56 (2023)
3005–3054.
[44] X. Wu, L. Xiao, Y. Sun, J. Zhang, T. Ma, L. He, A survey of human-in-the-loop for machine learning,</p>
      <p>Future Generation Computer Systems 135 (2022) 364–381.
[45] S. Es, J. James, L. Espinosa Anke, S. Schockaert, RAGAs: Automated evaluation of retrieval
augmented generation, in: N. Aletras, O. De Clercq (Eds.), Proceedings of the 18th Conference of
the European Chapter of the Association for Computational Linguistics: System Demonstrations,
Association for Computational Linguistics, St. Julians, Malta, 2024, pp. 150–158. URL: https:
//aclanthology.org/2024.eacl-demo.16/. doi:10.18653/v1/2024.eacl- demo.16.
[46] C. Holtermann, P. Röttger, T. Dill, A. Lauscher, Evaluating the elementary multilingual capabilities
of large language models with MultiQ, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), Findings of the
Association for Computational Linguistics: ACL 2024, Association for Computational Linguistics,
Bangkok, Thailand, 2024, pp. 4476–4494. URL: https://aclanthology.org/2024.findings-acl.265/.
doi:10.18653/v1/2024.findings- acl.265.
[47] H. Li, Y. Ning, Z. Liao, S. Wang, X. L. Li, X. Lu, W. Zhao, F. Brahman, Y. Choi, X. Ren, In search
of the long-tail: Systematic generation of long-tail inferential knowledge via logical rule guided
search, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on
Empirical Methods in Natural Language Processing, Association for Computational Linguistics,
Miami, Florida, USA, 2024, pp. 2348–2370. URL: https://aclanthology.org/2024.emnlp-main.140/.
doi:10.18653/v1/2024.emnlp- main.140.
[48] S. Bogdanov, A. Constantin, T. Bernard, B. Crabbé, E. Bernard, Nuner: Entity
recognition encoder pre-training via llm-annotated data, 2024. URL: https://arxiv.org/abs/2402.15343.
arXiv:2402.15343.
[49] N. Reimers, I. Gurevych, Making monolingual sentence embeddings multilingual using knowledge
distillation, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing, Association for Computational Linguistics, 2020. URL: https://arxiv.org/abs/2004.09813.
[50] A. V. Solatorio, Gistembed: Guided in-sample selection of training negatives for text
embedding fine-tuning, arXiv preprint arXiv:2402.16829 (2024). URL: https://arxiv.org/abs/2402.16829.
arXiv:2402.16829.
[51] M. Oliverio, P. F. Balestrucci, A. Mazzei, V. Basile, Dipinfo-unito at the gem’24 data-to-text task:</p>
      <p>Augmenting llms with the split-generate-aggregate pipeline, INLG 2024 (2024) 59.
[52] Q. Team, Qwen2.5: A party of foundation models, 2024. URL: https://qwenlm.github.io/blog/qwen2.</p>
      <p>5/.
[53] C. Gardent, A. Shimorina, S. Narayan, L. Perez-Beltrachini, Creating training corpora for NLG
micro-planners, in: R. Barzilay, M. Kan (Eds.), Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4,
Volume 1: Long Papers, Association for Computational Linguistics, 2017, pp. 179–188. URL:
https://doi.org/10.18653/v1/P17-1017. doi:10.18653/v1/P17- 1017.
[54] Z. Kasner, O. Dusek, Beyond traditional benchmarks: Analyzing behaviors of open LLMs on
data-to-text generation, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), Proceedings of the 62nd
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 12045–12072. URL: https:
//aclanthology.org/2024.acl-long.651/. doi:10.18653/v1/2024.acl- long.651.
[55] M. A. Stranisci, E. Bernasconi, V. Patti, S. Ferilli, M. Ceriani, R. Damiano, The world literature
knowledge graph, in: International Semantic Web Conference, Springer, 2023, pp. 435–452.
[56] R. Orlando, P.-L. H. Cabot, E. Barba, R. Navigli, Relik: Retrieve and link, fast and accurate entity
linking and relation extraction on an academic budget, 2025. URL: https://arxiv.org/abs/2408.00103.
arXiv:2408.00103.
[57] M. Boscariol, L. Bulla, L. Draetta, B. Fiumanò, E. Lenzi, L. Piano, Evaluation of llms on long-tail
entity linking in historical documents, arXiv preprint arXiv:2505.03473 (2025).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. C.-C. Chang</surname>
          </string-name>
          ,
          <article-title>Towards reasoning in large language models: A survey</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>1049</fpage>
          -
          <lpage>1065</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .findings-acl.
          <volume>67</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .findings- acl.67.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abusitta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey on explainable ai: Techniques, challenges</article-title>
          and open issues,
          <source>Expert Systems with Applications</source>
          <volume>255</volume>
          (
          <year>2024</year>
          )
          <fpage>124710</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kasai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sakaguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Le</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Asai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Radev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Inui</surname>
          </string-name>
          , et al.,
          <article-title>Realtime qa: What's the answer right now?</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>36</volume>
          (
          <year>2023</year>
          )
          <fpage>49025</fpage>
          -
          <lpage>49043</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          , X. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Are ChatGPT</article-title>
          and GPT-4
          <article-title>general-purpose solvers for financial text analytics? a study on several typical tasks</article-title>
          , in: M.
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , I. Zitouni (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>408</fpage>
          -
          <lpage>422</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .emnlp-industry.
          <volume>39</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp- industry.39.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I. O.</given-names>
            <surname>Gallegos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Barrow</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Tanjim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>N. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <article-title>Bias and fairness in large language models: A survey</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>50</volume>
          (
          <year>2024</year>
          )
          <fpage>1097</fpage>
          -
          <lpage>1179</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Montgomery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <article-title>Large language models portray socially subordinate groups as more homogeneous, consistent with a bias observed in humans</article-title>
          ,
          <source>in: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1321</fpage>
          -
          <lpage>1340</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>