<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MARTIALIS: An Open Framework for Knowledge Graphs-based Retrieval Augmented Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Edoardo Bianchini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filippo Bianchini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Calamo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca De Luzi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mattia Macrì</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimo Mecella</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sapienza Università di Roma</institution>
          ,
          <addr-line>NESMOS and DIAG</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Université Grenoble Alpes</institution>
          ,
          <addr-line>Autonomie, Gérontologie, E-santé, Imagerie et Société - AGEIS</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>We present Martialis1, a framework designed to enhance the performance of Large Language Models (LLMs) in domains-specific tasks (e.g. medical, legal) by implementing an ontology-based data representation approach. We designed the framework to be almost transparent to the end user: once set up with domain-relevant documents and an ontology, it enables both complex reasoning and domain-specific text generation. That is made possible by our novel information extraction pipeline that improves existing Retrieval Augmented Generation (RAG) techniques with a Domain Specific Knowlegde Graph - inferred from the documents - and a sanity check on the output - inferred from the ontology. This dual-layered approach ensures accuracy and relevance, addressing common limitations in existing solutions. The Martialis framework has been rigorously evaluated through collaboration with domain experts, comparing its performance against similar state-of-the-art systems. Results indicate improvements across key metrics, and speed-up in the eficiency of executing user tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Ontology-base Data Representation</kwd>
        <kwd>Retrieval Augemented Generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The development of Natural Language Processing (NLP) systems has rapidly advanced in recent years
with the rise of Large Language Models (LLMs), specifically generative ones, which have demonstrated
great capabilities in understanding and generating human-like text. These advancements have opened
new possibilities for addressing a variety of linguistic tasks. However, despite their potential,
generative LLMs face significant challenges, including aligning outputs with user intent and mitigating
hallucinations [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Such limitations can be attributed to structural aspects of LLMs: (i) the absence
of deep reasoning capabilities and (ii) the lack of inherent domain-specific knowledge. As the task
complexity grows, these limitations become more impacting. Hence, LLMs necessitate the introduction
of complementary methodologies to address them. A promising approach to addressing these structural
limitations lies in the adoption of hybrid AI methodologies, a paradigm emphasizing the integration of
sub-symbolic AI techniques [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], such as LLMs, with symbolic AI ones [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], including Knowledge Graphs
(KGs). The two approaches are somewhat complementary: sub-symbolic methods are characterized
by their capacity, adaptability, and reliance on large datasets, while symbolic methods are
transparent, precise, and data-eficient. Furthermore, while symbolic approaches often require the manual
encoding of knowledge by human experts, sub-symbolic techniques rely on automatic learning from
data, thereby reducing the need for manual intervention. By combining the adaptive and data-driven
nature of sub-symbolic AI with the interpretability and logical rigor of symbolic AI, hybrid AI ofers a
synergistic framework that addresses the deficiencies of each approach when used in isolation [
        <xref ref-type="bibr" rid="ref3 ref5">3, 5</xref>
        ].
      </p>
      <p>
        The introduction of ontology-based knowledge representation could pave the way for the development of
Intelligent Information Systems (IIS) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In this context, although several preliminary studies [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ] explore the potential benefits and
challenges of hybrid AI methodologies, few ofer practical, end-to-end, and repeatable solutions for
developing Intelligent Information Systems in real-world scenarios. To address this gap, we focus on
the challenge of building end-to-end and repeatable IIS capable of performing linguistic tasks that
require domain-specific knowledge and reasoning abilities, which we define as complex linguistic
tasks. In our previous works [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], we presented the concept of an enhanced Retrieval Augmented
Generation (RAG) pipeline. This pipeline leverages KGs to supply LLMs with relevant information and
rules, efectively compensating for their lack of domain-specific knowledge and reasoning capabilities
in complex linguistic tasks. In this article, we present martialis1, our open-source and easily repeatable
framework to realize such adaptable pipeline. martialis is an open framework designed to provide
users with accurate, precise answers or comprehensive, detailed text generation within domain-specific
contexts (e.g., legal, healthcare).
      </p>
      <p>To evaluate our framework, we collaborated with a team of medical experts from a hospital in Rome
to benchmark the performance in the healthcare domain. Specifically, we assessed its ability to answer
medical questions and generate various domain-specific documents derived from Electronic Clinical
Records (ECRs).</p>
      <p>The paper is structured as follows: in Section 2, we review the existing literature on the automatic or
semi-automatic construction of KGs from unstructured documents, as well as frameworks implementing
RAG pipelines and ontology-based compliance processes. In Section 3, we present the architecture of
martialis, along with a case study in the healthcare domain. Section 4 details the dataset composition
and the evaluation process used to assess our framework. Lastly, in Section 5, we discuss the limitations
and challenges of our framework and propose future directions for improvement.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In this section, we review existing works that directly relate to the core features and theoretical
foundation of Martialis: (i) Text2KG, for extracting structured knowledge from unstructured text to
create KGs in an automatic or semiautomatic way; (ii) KG RAG, namely methods that combine LLMs
with KGs for context-aware text generation; (iii) and Ontology-Augmented Text Generation approaches
that enhance generation by integrating ontologies for validation and consistency. Our objective is to
analyze existing frameworks that claim practical, implementable outcomes, emphasizing contributions
with concrete applications or open-source solutions. These works have been assessed by evaluating six
key criteria: code availability (open source), innovative methodology, adaptability (to multiple domains),
correctness, repeatability, and support the generation (of complex and detailed responses from KG).
Text to KG. The task of transforming unstructured text into a Knowledge Graph has always been
done by experts in the domain due to the high level of abstraction required to comprehend a text and
structure it into nodes and relationships. To automate this task several attempts have been made by
using various machine learning techniques [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. According to the latest survey [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] the results are
encouraging, since many frameworks have been created for extracting KG from unstructured text using
LLMs with little or –in some special cases– no human expertise, [
        <xref ref-type="bibr" rid="ref14 ref15 ref16 ref17">14, 15, 16, 17</xref>
        ].
      </p>
      <p>
        RAG with Knowledge Graphs. Hallucination is the most common and impactful limitation of
LLMs [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. RAG addresses these shortcomings by combining the strengths of LLMs with external
knowledge bases [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. RAG enables language models to access and process information from an external
source (e.g. documents, database) in real time, allowing them to generate more accurate, informative, and
up-to-date responses [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. While several techniques for traditional RAG have been proposed and tested,
there is still room for innovative approaches, such as incorporating KGs as external sources. Among the
1martialis’ code is available at: DIAG-Sapienza-BPM-Smart-Spaces/Martialis
most relevant, GLens [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] from Apple researchers and KRAGEN [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] are frameworks that integrate KGs
into RAG to eficiently retrieve and structure relevant information for precise query resolution. Some
other works examine RAG and KG primarily from a theoretical and visionary standpoint, lacking the
introduction of innovative approaches: in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] a framework that combines web search and Knowledge
Graphs that is called WeKnow-RAG is proposed, introducing a domain-specific RAG system with a
multi-stage information retrieval logic and an LLM self-assessment system.
      </p>
      <p>
        Ontology Augmented Text Generation. Ontologies have always been the pillar for structuring
data and enabling reasoning. The advent of LLM has drawn increasing interest in combining language
generation capabilities with the structured knowledge represented in ontologies [
        <xref ref-type="bibr" rid="ref16 ref23">16, 23</xref>
        ]. In particular,
we are interested in using LLMs for validating the structure of natural text against an ontology. We
envisioned a scoring mechanism on a Resource Description Framework KG automatically extracted
by the LLM from natural text checked against the ontology, in a way inspired by Shapes Constraint
Language (SHACL) 2. To the best of our knowledge, it does not exist in literature a similar mechanism
that is fully implemented in a complete text generation framework. The work by [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] proposes a similar
way of validating semantic artifacts using ontology and LLMs, however they did not run tests on natural
text. Finally, the authors of [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] present a vision of a roadmap to improve the accuracy of LLMs with
ontologies for sanity checks.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Martialis’ Architecture</title>
      <p>martialis is a next-generation, open-source IIS designed to perform complex linguistic tasks in a
specific domain of knowledge, addressing a significant gap in the current literature: directly supports
advanced text generation martialis (i) answers complex questions regarding the selected domain even
if they require reasoning steps and (ii) generates structured text for that domain.</p>
      <p>
        The complete architecture of the framework is presented in Figure 1. The design is modular,
comprising the following key components, each of which can be individually replaced with customized
implementations if necessary: Automatic KG-Extractor, Advanced Retrieval-Augmented-Generation,
and Ontology-Based Validator. The implementation leverages the Python libraries Llamaindex3 and
Langchain4, with GPT-4o [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] serving as the primary LLM.
      </p>
      <p>In the remainder of this section, we will detail the required inputs, elucidate the operational logic of
martialis using the healthcare domain as an illustrative example (also employed for validation), and
provide an in-depth analysis of martialis’ modules.</p>
      <sec id="sec-3-1">
        <title>3.1. Martialis’ Input</title>
        <p>
          After identifying the target knowledge domain, it is necessary to indicate to martialis, through a
configuration file, the source folder of the available documents. If the user wants to generate some
advanced domain-specific artifact, they have also to provide an ontology that describes that artifact.
Domain Documents The documents inherent to the specified domain are the main pillar for
martialis. We have no constraints on the data format, as long as it contains meaningful text. In the
healthcare domain, we used several plain text and .pdf files that describe patient hospitalizations. The
documents are parsed and processed into a Vector Store [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], ready to be retrieved by our system.
Ontology To support advanced text generation, martialis needs a blueprint to follow. This blueprint
is provided as an ontology for the type of text that the user wants to create. In our example domain,
we produced –guided by domain experts– one ontology for both clinical history and discharge letter
2https://www.w3.org/TR/shacl/
3https://www.llamaindex.ai/
4https://www.langchain.com/
generation. We support both OWL and SHACL ontologies, as long as they model all pieces of information
about the abstractions that have to be present in the text (e.g. patient class, healthcare provider class).
There is also the option to automatically generate the ontology using tools like [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], but we have
obtained the best result with a human-crafted one.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Preliminary Step pipeline</title>
        <p>
          As illustrated in Figure 2, when the user adds domain-specific documents or an ontology to the
martialis folder—either for the first time or incrementally—the framework automatically initiates
a preliminary processing step consisting of the following actions: (i) the documents are parsed and
stored in a Vector Database; (ii) a domain-specific Knowledge Graph is automatically generated from
the documents using the Automatic KG-Extractor module. This KG can then be queried to retrieve
precise and relevant information required for task completion; (iii) the ontology, which serves both
as a foundation for extracting the entities necessary for text generation tasks and as a reference for
correctness validation in the final output, is parsed and prepared for use within martialis.
KG-Automatic Extractor Module The Domain KG must be an instance-level Knowledge Graph.
Consequently, unlike ontologies or schema-level knowledge, it is typically impractical to construct it
manually, even with the assistance of domain experts. Given that the objective of this work is to make
martialis’ real-world application feasible, we deemed it essential to address the need for automatically
building its own Domain KG. Since the majority of knowledge and information about organizations
remains stored in unstructured text documents, martialis automatically constructs the Domain KG
from the unstructured documents provided. To implement the domain KG extraction based on [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
prompts, we decided to use the LLamaIndex Python library. Below, we provide real-world examples
of Knowledge Graphs generated using our KG-Automatic Extractor module. Figure 3 showcases the
extraction results from a single clinical record, visualized in Neo4j. On the left, the Knowledge Graph
represents the extraction from a single document, while on the right, the updated graph incorporates
an additional document. This incremental approach enables seamless integration of new documents
while retaining all previously extracted information, ensuring no data is lost during the process.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Question answering and Text Generation pipeline</title>
        <p>martialis can understand if the user is asking a question about the domain or wants to generate some
advanced text. It routes the user prompt to the Advanced-Retrieval-Augmented-Generation module or
the Ontology-Based-Validator module that we will present below in detail.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Advanced-Retrieval-Augmented-Generation Module. It is the module that support advanced</title>
        <p>
          question answering. It is capable of producing the answer relies on a chain of thoughts [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] that includes
information extracted from the documents, structured with the entities and relationships of the Domain
KG, using text2cypher prompt-engineering.
        </p>
        <p>The Domain KG enables the LLM to follow a more strict path in generating the final output, efectively
simulating an improvement in the LLM’s reasoning capabilities. Its function is carried out in the
augmented generation phase of the RAG pipeline and is iterative. In the standard RAG procedure, once
the enriched prompt is created, it is given to the LLM as input for generating a response. However, the
accuracy or relevance of the response is not verified using external tools. For complex language tasks,
producing the desired response may not be straightforward, which is why the Domain KG is used to
add a reasoning and abstraction layer. During the augmented generation phase, the specialized LLM
generates a preliminary output based on the enriched prompt. This output is bounded by the logic of
the Domain KG. An example from the healthcare domain would be:</p>
        <sec id="sec-3-4-1">
          <title>QA Task</title>
          <p>Q: "Which medications were prescribed to the patient Patient_008? Provide a comprehensive list"
A: Sinemet 100/25 (1 tablet 3 times daily: 7:30 AM, 12:30 PM, 5:30 PM), Zoloft 50 mg (1
tablet in the evening).</p>
          <p>Ontology-Based-Validator Module. This module is capable of generating text artifacts that require
deep comprehension of a specific knowledge domain. The final text output is validated against the
provided ontology. When this kind of request is detected, martialis uses prompt engineering to retrieve
from the ontology the relevant class and properties for the specific task (e.g. medical_condition,
treatment_outcome, patient_information). Those extracted objects are integrated into a chain of
prompts to generate the requested text. The text generated in this way is then passed to the validation
step: it is converted into a Resource Description Framework KG using the KG-Automatic Extractor
module and checked against the ontology. A score from 0 to 1 based on compliance with the ontology
is computed by checking the percentage of constraints on the class and the attributes that are present
or missing. In this way, martialis can show the user a numeric value for confidence in the generated
output.</p>
          <p>=
∑︀=1 () + ∑︀=1  ()
 + 
(1)
Where:  is the generated text, () is 1 if property  is present in , and 0 otherwise.  () is 1 if
class  is present in , and 0 otherwise.  is the total number of properties.  is the total number of
classes. An example we generated with score 0.63 is:</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>Text Generation Task</title>
          <p>Class Extracted:
ment_outcome [...]
Q: "I need a detailed summary of the clinical history for the patient. patient_008"
patient_information,
medical_condition, therapy,
treatA: The patient, patient_008, female, born on May 26, 1952, is being treated at the
Neurology Unit of Hospital. She initially presented with tremors and was prescribed Artane,
which she later reduced due to internal tingling and nausea. Subsequently, she was switched to
Sinemet, which led to a significant improvement in her tremors [...].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>To validate our framework, we employed two types of evaluations, automatic and human, tailored
to the structure and objectives of our approach. For tasks focused on answering specific questions,
we utilized a single automatic evaluation approach. This involved assessing both the reliability of the
extracted context and the consistency of the generated responses about the posed questions. For the text
generation tasks, automatic validation was conducted by comparing the knowledge graph derived from
the final answer against the ontology provided as the initial context for the task. Additionally, for the
text generation task, we performed a human evaluation. This involved developing and administering
detailed questionnaires to medical experts to assess the quality and domain-specific relevance of the
generated outputs.</p>
      <p>Dataset. The dataset used for our case study and testing consists of 32 clinical records5, which were
collected and selected by an expert medical team from a hospital in Rome. This selection was carried
out following approval from the hospital’s ethical committee to process the data strictly for research
purposes. The clinical records have been fully anonymized through PII masking with Named Entity
Recognition (NER) implemented through Llama Index. To maintain the integrity of the information, the
clinical records were left unprocessed in their original language, Italian. As a result, any information
produced by the model is also in Italian.</p>
      <p>Automatic Evaluation for QA task. We designed a test involving eight questions of increasing
complexity. These questions required the extraction and re-elaboration of various types of medical
information to produce accurate answers. Each of the eight questions was applied to all test documents
available in our dataset. In parallel, we collaborated with the team of domain experts to extract the
correct answers for each of the eight questions from the clinical records. These expert-verified answers
5Datset for QA, clinical record, and all evaluation results are open and accessible at the following link:
Healthcare_Dataset_Martialis
served as the ground truth, allowing us to directly compare the outputs of our framework with the
expected responses. Finally, we selected and calculated four automatic metrics using the DeepEval
framework6. Selected metrics are the following ones: (i) Contextual Precision. It evaluates the quality of a
ranked list of retrieved nodes by considering both their relevance and their position in the ranking. The
metric is computed using the total number of relevant nodes in the retrieval context and the relevance
of each node; (ii) Contextual Recall. This metric assesses the quality of the RAG pipeline’s retriever by
measuring the extent to which the retrieved context aligns with the expected output; (iii) Faithfulness.
It evaluates whether the model generates factually correct information by comparing the actual output
of the LLM to the provided context; (iv) Answer Relevancy. This metric measures the quality of the
RAG pipeline’s generator by evaluating how relevant the actual output of the LLM is compared to the
provided input.</p>
      <p>The results of this automatic evaluation were aggregated using two complementary approaches: (i)
Overall Performance Analysis; (ii) Performance by Question Complexity Analysis. The overall performance
analysis shows strong results across the metrics. The Contextual Precision (80.36%) and Contextual
Recall (83.07%) demonstrate the model’s ability to efectively understand and retrieve relevant
contextual information. The Faithfulness score (80.64%) indicates that the model’s responses are fairly
faithful to the original content. The Answer Relevancy score (88.84%) suggests that the answers are
highly relevant to the questions asked.</p>
      <p>As for question complexity analysis, Contextual Precision shows the highest degree of variability with
relevant low values for question 2 and question 5. Notably, on most complex questions (e.g. questions 6,
7, and 8) it seems to be very precise in exacting context reaching values above 90%. Contextual Recall
shows a similar situation: variability is high, we can note that as for Contextual Precision, we have the
lowest values for questions 2 and 5. Again, greater values are distributed in the last three questions
exceeding the threshold of 90%. Furthermore, Faithfulness shows a diferent distribution: variability
is lower, and the smallest values are concentrated in questions 1 and 7. Question 8 shows still a great
result. Finally, Answer Relevancy shows the best results and the lowest variability as it is the only metric
having all values above the threshold of 75%. In particular, for question 1 we reached 100% of answer
relevancy.</p>
      <p>Automatic Evaluation for Text Generation. We evaluated the model’s outputs in two distinct
scenarios: (i) generating the patient’s clinical history, and (ii) drafting the patient’s discharge letter. In
both cases, the initial stage of response generation involves extracting key entities and relationships
from the ontology preliminarily extracted. Once the text-based response is generated, a knowledge
graph is constructed from the output using the KG-Automatic Extractor Module. This module identifies
the entities and relationships present in the generated text and maps them into a structured knowledge
graph format. The resulting graph is then compared against the existing ontology to assess its alignment
with the ground truth.</p>
      <p>
        The evaluation metric used was the score of matching elements (both entities and relationships)
between the extracted knowledge graph and the reference ontology presented in Section 3. The results,
reflecting the model’s ability to preserve the semantic integrity of the data during text generation,
shows low values but this outcome align with expectations. This evaluation is not absolute; rather, it
serves as a trigger to initiate the response revision process
Human Evaluation for Text Generation. To carry out and report the human evaluation of the
text generation by martialis, we followed the guidelines in [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. First, we determined the objective:
to collect human feedback on the quality of the generated output. We intended this evaluation as an
intrinsically quantitative evaluation. Hence, we made subjective opinions about a document quantifiable.
We selected questions emphasizing the quality of the documents and transformed them into evaluation
criteria. As a rating scale, we opted for a 4-point Likert scale, widely used in NLG assessments, and
provided an interpretation of the scale before the questions. We recruited 18 medical domain experts,
6DeepEval is an open-source evaluation framework for LLMs
including doctors, medical trainees, medical PhD students, and nurses.
      </p>
      <p>
        Generation Approaches Used. We compared the quality of draft documents written by a domain expert
with those generated by an LLM using diferent approaches. In particular, we used the martialis’
generation approach, a prompt guided by a domain ontology, and standard zero/few-shot prompts.
For each approach, the LLM used is GPT-4o [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Using these four diferent document origins allowed
us (i) to evaluate the potential future benefits of LLMs as draft document writers by comparing the
quality of human-written and LLM-written documents, and (ii) to analyze the eficacy and eficiency of
martialis’ approach by comparing it with standard approaches Questionnaire Structure. We generated
24 documents, 6 for each of the four generation approaches, divided into 12 clinical histories and 12
discharge letters. This allowed us to evaluate a more heterogeneous set of sentences while requiring
each participant to evaluate only a subset of the total sentences. Subsequently, the results from all
forms were combined and analyzed as a single dataset. For any document, participants rated seven text
features (Grammar, Medical Lexicon, Structural coherence, Logical Coherence, Ambiguity’s Absence,
Reliability, Human-like Writing). Overall, the results of the Human Evaluation of text generation, shown
that the highest-rated documents are the human-written ones (2.88), followed by those generated with
few-shot prompts, martialis-generated documents, and those generated with zero-shot (2.72). The
results fall in a restricted range but show a clear separation between the performance of human-written
and few-shot-generated documents, the martialis-generated documents, and zero-shot-generated
ones. Similar observations emerge when looking into single text features; human-written documents
outperform in four distinct features, few-shot approach excelled in two features, martialis’ ones
excel in one, while zero-shot approach never led in any features. Furthermore, focusing only on the
LLMs-generated documents, few-shot and martialis approaches outperformances increase up to four
and three while zero-shot approach stays at zero.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion &amp; concluding remarks</title>
      <p>As far as the automatic evaluation, results are solid; the insights suggest that LLMs, if correctly guided,
might be truly useful in drafting semi-structured documents like those in our case study. This assumption
is reinforced considering the human evaluation, where there is a clear gap in documents quality between
the ones generated without guidance (zero-shot prompt) and the ones generated with guidance
(fewshot prompt and martialis). Furthermore, since the performances of martialis and few-shot are
complementary, it seems interesting to explore martialis’ RAG on KGs approach in synergy with
classical few-shot prompting techniques.</p>
      <p>We argue that martialis has the potential to stand out for completing complex linguistic tasks. We
intend to continue working in this direction by (i) releasing evolutionary updates working on each
module and (ii) investigating, and eventually fine-tuning, open LLMs to replace GPT-4o, ensuring
accessibility and repeatability.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The work of Mattia Macrì has been supported by the PhD fellowship Pubblica Amministrazione DM118
- CUP B83C22003460006. The work of Marco Calamo and Filippo Bianchini has been supported by
the Next-Generation EU (Italian PNRR - M4 C2, Invest 1.3 - D.D. 1551.11-10-2022), named PE4 - MICS
(Made in Italy - Circular and Sustainable).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4o-2024-08-06 in order to: Grammar and
spelling check.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.-Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-P.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.-N.</given-names>
            <surname>Ning</surname>
          </string-name>
          , L. Yuan, LLM Lies:
          <article-title>Hallucinations are not Bugs, but Features as Adversarial Examples</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2310</volume>
          .
          <fpage>01469</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kankanhalli</surname>
          </string-name>
          ,
          <article-title>Hallucination is inevitable: An innate limitation of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2401.11817</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ilkou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koutraki</surname>
          </string-name>
          ,
          <article-title>Symbolic vs sub-symbolic ai methods: Friends or enemies?</article-title>
          ,
          <source>in: CIKM (Workshops)</source>
          , volume
          <volume>2699</volume>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Flasiński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Flasiński</surname>
          </string-name>
          , Symbolic artificial intelligence,
          <source>Introduction to Artificial Intelligence</source>
          (
          <year>2016</year>
          )
          <fpage>15</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Calegari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ciatto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Omicini</surname>
          </string-name>
          ,
          <article-title>On the integration of symbolic and sub-symbolic techniques for xai: A survey</article-title>
          ,
          <source>Intelligenza Artificiale</source>
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>7</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Evaluation of intelligent information system</article-title>
          ,
          <source>in: 2022 IEEE 22nd International Conference on Software Quality</source>
          , Reliability, and
          <string-name>
            <surname>Security Companion (QRS-C)</surname>
          </string-name>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>183</fpage>
          -
          <lpage>188</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <article-title>Survey and tutorial on hybrid human-artificial intelligence</article-title>
          ,
          <source>Tsinghua Science and Technology</source>
          <volume>28</volume>
          (
          <year>2022</year>
          )
          <fpage>486</fpage>
          -
          <lpage>499</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Correia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Grover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Pimentel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chaves</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. A. De Almeida</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Fonseca</surname>
          </string-name>
          ,
          <article-title>Designing for hybrid intelligence: A taxonomy and survey of crowd-machine interaction</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>2198</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Enhancing personalized search with ai: A hybrid approach integrating deep learning and cloud computing</article-title>
          ,
          <source>International Journal of Innovative Research in Computer Science &amp; Technology</source>
          <volume>12</volume>
          (
          <year>2024</year>
          )
          <fpage>127</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Calamo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>De Luzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Macrì</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mecella</surname>
          </string-name>
          ,
          <article-title>Enhancing Complex Linguistic Tasks Resolution through Fine-tuning LLMs, RAG and Knowledge Graphs</article-title>
          ,
          <source>in: International Conference on Advanced Information Systems Engineering</source>
          , Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Calamo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>De Luzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Macrì</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mecella</surname>
          </string-name>
          ,
          <article-title>A service-based pipeline for complex linguistic tasks adopting llms and knowledge graphs</article-title>
          , in: M.
          <string-name>
            <surname>Aiello</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Barzen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Dustdar</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Leymann</surname>
          </string-name>
          (Eds.),
          <string-name>
            <surname>Service-Oriented</surname>
            <given-names>Computing</given-names>
          </string-name>
          , Springer Nature Switzerland, Cham,
          <year>2025</year>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>161</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <article-title>A survey of knowledge graph construction using machine learning</article-title>
          .,
          <source>CMES-Computer Modeling in Engineering &amp; Sciences</source>
          <volume>139</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. Zhang,</surname>
          </string-name>
          <article-title>Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities</article-title>
          ,
          <source>World Wide Web</source>
          <volume>27</volume>
          (
          <year>2024</year>
          )
          <fpage>58</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lairgi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moncla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cazabet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Benabdeslem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cléau</surname>
          </string-name>
          ,
          <article-title>itext2kg: Incremental knowledge graphs construction using large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2409.03284</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Soh, Extract, define, canonicalize
          <article-title>: An llm-based framework for knowledge graph construction</article-title>
          ,
          <source>arXiv preprint arXiv:2404.03868</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Kommineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>König-Ries</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Samuel</surname>
          </string-name>
          ,
          <article-title>From human experts to machines: An llm supported approach to ontology and knowledge graph construction</article-title>
          ,
          <source>arXiv preprint arXiv:2403.08345</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Kong</surname>
          </string-name>
          , W. Liu, Docs2kg:
          <article-title>Unified knowledge graph construction from heterogeneous documents assisted by large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2406.02962</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , E. Perez,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q. G. .,</surname>
          </string-name>
          <article-title>Retrieval-augmented generation for large language models: A survey</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2312</volume>
          .
          <fpage>10997</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jaitly</surname>
          </string-name>
          ,
          <article-title>Kglens: A parameterized knowledge graph solution to assess what an llm does and doesn't know</article-title>
          ,
          <source>arXiv preprint arXiv:2312.11539</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>N.</given-names>
            <surname>Matsumoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Moran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Venkatesan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <article-title>KRAGEN: a knowledge graph-enhanced RAG framework for biomedical problem solving using large language models</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>40</volume>
          (
          <year>2024</year>
          )
          <article-title>btae353</article-title>
          . URL: https://doi.org/10.1093/bioinformatics/btae353. doi:
          <volume>10</volume>
          .1093/bioinformatics/btae353. arXiv:https://academic.oup.com/bioinformatics/article-pdf/40/6/btae353/58186419/btae
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ni</surname>
          </string-name>
          , H. Cheng,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Weknow-rag: An adaptive approach for retrieval-augmented generation integrating web search and knowledge graphs</article-title>
          ,
          <source>arXiv preprint arXiv:2408.07611</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Allemang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <article-title>Increasing the llm accuracy for question answering: Ontologies to the rescue!</article-title>
          ,
          <source>arXiv preprint arXiv:2405.11706</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tufek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saissre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <article-title>Validating semantic artifacts with large language models</article-title>
          ,
          <source>in: Proceedings of the 21th European Semantic Web Conference (ESWC)</source>
          , Krete, Greece,
          <year>2024</year>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Monti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kutz</surname>
          </string-name>
          , G. Righetti,
          <string-name>
            <given-names>N.</given-names>
            <surname>Troquard</surname>
          </string-name>
          ,
          <article-title>Improving the accuracy of black-box language models with ontologies: a preliminary roadmap</article-title>
          ,
          <source>in: Proceedings of the Joint Ontology Workshops</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hurst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Goucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ostrow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Welihinda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          , et al.,
          <article-title>Gpt-4o system card</article-title>
          ,
          <source>arXiv preprint arXiv:2410.21276</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Survey of vector database management systems</article-title>
          ,
          <source>The VLDB Journal</source>
          <volume>33</volume>
          (
          <year>2024</year>
          )
          <fpage>1591</fpage>
          -
          <lpage>1615</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>C. van der</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gatt</surname>
          </string-name>
          , E. van
          <string-name>
            <surname>Miltenburg</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Krahmer</surname>
          </string-name>
          ,
          <article-title>Human evaluation of automatically generated text: Current trends and best practice guidelines</article-title>
          ,
          <source>Comput. Speech Lang</source>
          .
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <fpage>101151</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>