<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Models: an Experimental Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claudia Diamantini</string-name>
          <email>c.diamantini@univpm.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alex Mircoli</string-name>
          <email>a.mircoli@univpm.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessia Pagnotta</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Potena</string-name>
          <email>d.potena@univpm.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Cristina Recchioni</string-name>
          <email>c.recchioni@univpm.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuele Storti</string-name>
          <email>e.storti@univpm.it</email>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>9</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>Large Language Models (LLMs) show impressive performance in many natural language processing (NLP) tasks, including code generation and question answering. However, they sufer from various limitations, such as the generation of hallucinated content and reliance on outdated internal knowledge. To address these challenges, Retrieval-Augmented Large Language Models (RA-LLMs) have emerged as a promising solution, enhancing response accuracy by dynamically incorporating external knowledge through retrieval-augmented generation (RAG). Despite the advantages of RA-LLMs, the implementation of efective retrieval-augmented pipelines remains a complex task, since various techniques exist for document chunking, text similarity evaluation, and model selection, each influencing the overall system performance. In this work, we propose a general methodology for designing a RA-LLM pipeline, outlining key approaches for each phase and evaluating the efectiveness of diferent configurations. We also perform an experimental evaluation of the pipeline's performance using real-world data, analyzing the impact of diferent techniques at each stage. The findings of this study ofer insights into optimizing RA-LLM architectures for enhanced information retrieval and response generation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, Large Language Models (LLMs) have demonstrated impressive capabilities in
understanding and generating text [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In the field of natural language processing (NLP), they are currently
used in a wide range of applications, such as code generation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], conversational AI [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and emotion
recognition [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], due to their ability to model complex semantic relations. However, they still struggle
with certain limitations, including the tendency to produce hallucinated content [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and the reliance on
outdated internal knowledge. The latter problem is mainly a consequence of the huge training costs
of these models, which limit the possibility of updating the model with fresh data. In this context,
Retrieval-Augmented Large Language Models (RA-LLMs) have gained popularity due to their ability to
generate accurate responses by considering information extracted from external knowledge sources
through retrieval-augmented generation (RAG) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This approach enables LLMs to access up-to-date
and contextually relevant information, thus enhancing the accuracy and reliability of their responses
by integrating internal knowledge with dynamically retrieved data.
      </p>
      <p>
        Retrieval-augmented generation can be performed in several diferent ways, i.e. by defining diferent
pipelines or by using diferent techniques to implement a step of the same pipeline. For example, many
techniques have been proposed in the literature to split documents into smaller and more manageable
text chunks. Likewise, various approaches for evaluating text similarity during the retrieval phase can be
adopted. Moreover, many LLMs are available online, each with diferent characteristics and performance.
As a consequence, efectively combining the existing approaches and tools can be challenging. However,
the appropriate choice of techniques to be used at each stage of the pipeline could significantly increase
the performance of the system in terms of relevance of the retrieved documents and quality of the
generated response, as demonstrated by several works (e.g., [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). The present work aims to propose a
general methodology for RA-LLM and outline the possible approaches that can be used for each step of
the pipeline, as well as to evaluate the efectiveness of diferent combinations of techniques.
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
      <p>The main contributions of the paper are:
• the definition of a pipeline for RA-LLM to generate relevant and complete answers to questions
asked by users by retrieving information from an external knowledge base;
• the experimental evaluation of the pipeline performance by means of real-world data and, for
each phase of the pipeline, of the impact of the use of a specific technique, chosen among those
most commonly adopted in the literature.</p>
      <p>The rest of the paper is structured as follows: the next Section presents some relevant related works on
Large Language Models and Retrieval-Augmented Generation. The methodology for the creation of the
RA-LLM is proposed in Section 3, while Section 4 discusses the results of an experimental evaluation of
the three phases of the methodology. Finally, Section 5 draws conclusions and discusses future works.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP) by
demonstrating remarkable capabilities in language understanding and text generation. The training
process of an LLM involves exposing the model to vast amounts of textual data, enabling it to learn
patterns, semantics, and contextual relationships. However, the paradigm presents key challenges:
(1) once trained, the model’s knowledge remains static, reflecting only the data available up to that
point, and (2) the LLMs are prone to hallucinations [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] with domain-specific queries, since training data
primarily consist of general-purpose texts. Hence, given the expansive computational cost of training
LLMs from scratch, or retraining them with new information, integrating custom knowledge into these
models is a nontrivial challenge. Several strategies have been explored in the literature to address
these issues, such as fine-tuning, i.e., training the LLM in a task-specific dataset to refine its knowledge
[
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref9">9, 10, 11, 12</xref>
        ], prompt engineering, i.e., supporting the model by providing a structured prompt with
detailed instructions for the task [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
        ], and Retrieval-Augmented Generation (RAG), i.e., enriching
the prompt with information extracted from external knowledge bases [16]. In particular, RAG is
emerging as a leading solution, allowing for updates to the LLM’s knowledge base without the need to
re-train the model. This is achieved by dynamically querying external knowledge sources, extracting
relevant domain-specific information, and incorporating it into the model’s output. Simpler approaches
to RAG consist in a pipeline starting with an indexing phase, in which documents are processed,
segmented and embedded for storage in vector databases. A following stage involves retrieval, which
can be sparse, i.e. word-based and applied mostly in text retrieval, or dense, embedding queries and
external knowledge into vector spaces. The pipeline is concluded by a generation stage [17].
      </p>
      <p>Within the taxonomy of foundational RAG approaches proposed in [18], query-based RAGs are among
the most widely studied in the literature. These models aim to integrate the user’s query with retrieved
information, which is then used for the initial stage of the generator’s input. As such, the combined
content, including the original query and the retrieved information, is processed by the generator as a
unified input, allowing to produce a contextually richer response. For instance, REALM [ 19] selects the
top-k most relevant article snippets based on the query and passes each snippet along with the question
to the LLMs to generate k responses. These responses are then merged to produce the final answer.</p>
      <p>Lewis et al. [16] employed pre-trained parametric and non-parametric memory for language
generation, using BART as the generator to significantly improve the generation process. In a similar way,
In-Context RALM [20] leverages BM25 for document retrieval, and trains a predictive reranker which is
aimed at reordering and integration of the top-ranked documents. In some cases, the retrieval process
itself can be considered optional, as in [21], tailoring the behavior to diverse task requirements through
self-reflection.</p>
      <p>Beyond the query-answering RAGs, alternative RAG strategies include incorporating retrieved
information into generative models as latent representations (e.g., FID [22]), or generative models
integrating retrieval information through logits during the decoding process (e.g., kNN-LM [23]).
Additionally, a further approach applicable to data sequences allows the use of retrieved information
directly as responses, instead of relying only on generation, aiming to save resources and accelerate
response time (e.g., REST [24]).</p>
      <p>Since performances of classical RAG architectures, in terms of relevance, precision and recall, vary
considerably, more recent approaches introduce specific improvements focusing on enhancing retrieval
quality and reducing latency through pre-retrieval and post-retrieval strategies. Among the former,
improvements include query rewriting and transformation (e.g., [25]), while reranking, summarization
and fusion are among the latter. Aspects which can be optimized during the indexing phase include the
chunking strategy, possible enrichment of chunks with metadata, and definition of proper indexing
structures. Chunk optimization methods are employed by [26], in which text chunks are broken down
into finer atomic statements to achieve higher recall and improved results, while in RAPTOR [ 27], a tree
structure is produced by recursive embedding, clustering, and summarization of text chunks to address
the lack of contextual information. Since redundant information can interfere with the final generation
of LLM, the retrieved content is typically processed. Among others, solutions include reranking, i.e.
ordering of retrieved documents to highlight the most relevant ones [28], or compression of the context
(e.g., [29]). On the other hand, alternative and efective methods to enhance the retrieval phase include
Knowledge Graphs, where entities are connected by relations, serving as a pre-built index for retrieval
[30].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this section, we describe the proposed methodology for building Retrieval-Augmented Large Language
Models. The methodology is depicted in Figure 1 and consists of three main phases: Indexing, Retrieval
&amp; Reranking, and Text Generation. A detailed description of each methodological step is presented in
the following subsections, along with the discussion of the main issues of each phase.</p>
      <sec id="sec-3-1">
        <title>3.1. Indexing</title>
        <p>The goal of the Indexing phase is to process and store data belonging to an external knowledge base so
that they are optimized for eficient data retrieval during the following phase (i.e., Retrieval &amp; Reranking.
The input of this phase is a set of text documents which contain information relating to the domain
of interest, while the output is a set of sentence embeddings, along with metadata, stored in a vector
database. The main steps of the Indexing phase are:
• chunking: texts are split into smaller chunks in order to store them in a more eficient way and
allow searches to be performed at a higher level of granularity;
• embedding generation: text chunks are transformed into vector embeddings;
• storage: vector embeddings are stored and indexed into a vector database, which ensures optimized
access to this type of data.</p>
        <p>Among the existing approaches for chunking, the two most popular ones - which will be evaluated
in section 4 - are the recursive character chunker (RCC) and the semantic chunker (SC). The RCC
iteratively splits text into chunks, going through diferent levels of delimiters until it reaches a division
that complies with the established criteria. In practice, the chunker divides the text into segments
of a predetermined size, known as chunk size, with the help of delimiter characters. Although these
characters can be customized, the default ones are: [”\n\n”, ”\n”, ” ”], while the chunk size is determined
by a number of characters chosen according to specific needs. The process starts by looking for the first
delimiter (e.g., ”\n\n”), which indicates the end of a paragraph. If the text between two ”\n\n” is too long
to fit within the chunk size limit, the system does not break the text, but moves on to the next delimiter,
namely ”\n”, which separates sentences. If this division also does not respect the size limit, it tries again
with the next delimiter (” ” for words) until a segmentation that respects the size constraint is obtained.</p>
        <p>The SC is based on the semantic similarity between groups of sentences, in order to create chunks
that preserve the semantics of the text. It is particularly advantageous in the analysis of complex, long
documents, such as technical manuals, scientific papers and reports, where a simple split based on
static delimiters could compromise the overall understanding of the text. The SC splits the document
into groups of three sentences, combining each sentence with the previous and the next one through
an overlap mechanism. The groups of three sentences are then processed sequentially to generate
vector embeddings representing each group. At this point, the cosine similarity between each current
embedding and the next one is calculated and the groups of sentences are split when the cosine similarity
is less than a given threshold.</p>
        <p>For what concerns vector embedding, they are generated to improve the Retrieval phase, allowing
a search based on semantic similarity between the considered texts (i.e., the user prompt and the
documents in the knowledge base). Embeddings are usually generated using pre-trained Transformer
models, such as Google BERT or OpenAI text embedding models. In the following, embeddings will be
generated through OpenAI API1.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Retrieval &amp; Reranking</title>
        <p>The Retrieval &amp; Reranking phase aims at finding the most relevant set of texts from the knowledge
base stored in the vector database. In this phase, user prompts are transformed into vector embeddings
and are then compared to those in the vector database. Several diferent strategies can be used to
determine the most relevant texts to be retrieved; among them, the most widely adopted are: semantic
search, cosine similarity, and hybrid search. The first approach evaluates the similarity between two
text embeddings by means of the Euclidean distance, while the second approach exploits the cosine
similarity. The hybrid search combines keyword search and semantic search. For keyword search, it
uses the Okapi BM25 ranking function, which is defined as:</p>
        <p>BM25(, ) =

∑</p>
        <p>IDF(  ) ⋅  (  , ) ⋅ ( 1 + 1)
=1  (  , ) +  1 ⋅ (1 −  +  ⋅ avg_doclength )
||
(1)
where:
• f(  ,  ) is the frequency of term   in document d
• || is the length of the document d
• avg_doclength is the average document length in the corpus</p>
        <p>The results can also be improved by coupling one of those retrieval strategies with a reranker. In fact,
in many cases the initial retrieval phase may not identify the most relevant documents to the user’s
question. In these situations, the reranker allows to reorganize the selected candidates and filter out
less relevant texts by applying further selection criteria.</p>
        <sec id="sec-3-2-1">
          <title>1https://platform.openai.com/docs/guides/embeddings</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Text Generation</title>
        <p>The goal of the last phase (Text Generation) is to generate a response expressed in natural language. The
response is produced by giving as input to an LLM both the user prompt and the retrieved documents,
which are used to extend the internal knowledge of the LLM through few-shot learning. Given the
same number of documents provided as input, the quality of the output of this phase depends on the
performance of the chosen LLM.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>This section presents an empirical analysis designed to evaluate the performance of the RAG-based
system. The evaluations are conducted on a knowledge base consisting of English documents focused
on Italian cuisine and nutrition. The section describes the generation process of queries and expected
outputs, the evaluation metrics used to compare the three phases of the tested methodology, and a
critical discussion of the obtained results.</p>
      <sec id="sec-4-1">
        <title>4.1. Experimental setup</title>
        <p>4.1.1. Knowledge base
To assess the performance of the RA-LLM system, raw data in English are sourced from various online
platforms, including Wikipedia2 and arXiv3. The text is extracted from the gathered documents,
available in formats such as PDF and HTML, to build a dataset of 74 entries with an average length
of 9,240 characters. The data focus mainly on Italian cuisine, including recipes from diferent Italian
regions and the history of local culinary traditions. In addition, several scientific papers focus on
nutrition issues, with a special emphasis on elderly nutrition and the Mediterranean diet.
4.1.2. Generation of queries and expected outputs
For a comprehensive evaluation of the system’s ability to respond to user’s prompts, 320 questions and
answers are generated with the support of the GPT-4o-mini language model. These are organized into
three levels of complexity:
• Simple Questions (100): aim to verify the knowledge of basic facts explicitly present in the
documents.
• Intermediate Questions (102): require contextual understanding and the integration of multiple
pieces of information.</p>
        <p>• Multi-Document Questions (112): require in-depth analysis, synthesis skills, and critical thinking.
An excerpt of the questions generated by the LLM is shown in Table 1.</p>
        <p>The questions are used to test the RAG processes and therefore must be designed to challenge the system
efectively. For questions classified as ”simple” and ”intermediate”, the process begins by generating 5
questions for shorter documents and 10 for longer ones. Then a manual evaluation is performed to select
the most appropriate questions to achieve the desired number of questions and answers. Specifically,
the final selection associates a single question with each short document and multiple questions with
each long document. This approach ensures a balanced distribution of questions based on the length
and complexity of the documents.</p>
        <p>Finally, ”multi-document” questions are generated by using multiple texts as reference context.
Groups of three documents are selected sequentially, and the model is required to create a question
that involves at least two documents in the group, preferably all three. For each triad, six questions
are initially generated and then manually reviewed to identify the best ones. It is essential that each</p>
        <sec id="sec-4-1-1">
          <title>2https://it.wikipedia.org 3https://arxiv.org/</title>
          <p>4.1.3. Performance metrics
In this analysis, we evaluate the results through the approach ofered by the deepeval 4 library, which
allows advanced metrics calculation using diferent LLMs. Therefore, all the evaluation metrics used
are based on the LLM-as-a-judge paradigm. In our case, we use the OpenAI GPT-4o-mini model with a
temperature value of zero.</p>
          <p>The adopted approach allows examining the systems’ performance in terms of contextual accuracy
and relevance of responses. In particular, the considered metrics are:
1. Contextual Precision
2. Contextual Recall
3. Contextual Relevancy
4. Answer Relevancy
5. Faithfulness</p>
          <p>The Contextual Precision measures how well the system prioritizes the most useful documents for a
specific query. This metric is calculated as follows:</p>
          <p>1
CP = Number of Relevant Nodes ∑=1 (</p>
          <p />
          <p>Number of Relevant Nodes Up to Position  ×   )
Where:
•  is the ( + 1)-th chunk in the retrieval context.
•  is the length of the retrieval context.</p>
          <p>to relevant chunks, 0 otherwise.</p>
          <p>•   is the binary relevance of the  -th chunk in the retrieval context. A value of   = 1 is assigned</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4https://github.com/confident-ai/deepeval</title>
          <p>The Contextual Recall evaluates the quality of the retrieval pipeline by measuring how much the retrieval
context is aligned to the expected output. The indicator is defined by the following equation:
Contextual Recall = Number of Attributable Statements</p>
          <p>Total Number of Statements
Specifically, the metric is determined using a language model to extract all statements in the expected
output. The model then verifies how many of these claims can be assigned to the chunks present in the
retrieval context, thus identifying the number of ”attributable statements”.</p>
          <p>The Contextual Relevancy measures the quality of the retriever by evaluating the overall relevance of
the information present in the retrieval context for a given input. The Contextual Relevancy score is
determined by the following equation, using an approach similar to the previous metric:
Contextual Relevancy = Number of Relevant Statements</p>
          <p>Total Number of Statements
The Answer Relevancy evaluates the quality of the response generation phase by measuring the degree
of relevance of the actual output versus the input provided. The value of the Answer Relevancy is
determined by the following equation:</p>
          <p>Answer Relevancy = Number of Relevant Statements</p>
          <p>Total Number of Statements
First, the calculation involves using a language model to extract statements from the actual output.
Then, the model evaluates the relevance of each statement to the input, identifying the total number of
”relevant statements”.</p>
          <p>Finally, the Faithfulness metric verifies how consistent the actual output is with the contents of the
retrieval context. The value of the metric is defined by the following equation:</p>
          <p>Faithfulness = Number of Truthful Claims</p>
          <p>Total Number of Claims
As with the other metrics, a language model is used to extract all the statements in the actual output.
Subsequently, the same model classifies an assertion as a ”truthful claim” only if it is relevant to the
retrieval context.</p>
          <p>It is important to note that all three stages of the methodology, i.e. indexing, retrieval, and generation,
will be tested. For the latter, the system is tested using the OpenAI GPT-4o-mini5 generative model and
some open source alternatives, including Mistral-7B-Instruct-v0.26, Mixtral-8x7B-Instruct v0.17, and
Llama-3.1-70B-Instruct8. In this way, it is also possible to assess the ability of open-source generative
models.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results</title>
        <p>This section presents the results obtained during the evaluation of the proposed system based on
RA-LLM. The main objective is to evaluate diferent variants with the purpose of determining the most
efective approach in terms of generated responses. The outputs of the approaches are available at the
following link: https://anonymous.4open.science/r/RAG_Text-CE8E.
4.2.1. Evaluation of the Indexing phase
For the indexing phase, three approaches are compared, namely the Naïve RAG, the Semantic Chunker
and the OpenAI Large Embedding RAG. The characteristics of the three systems are shown in Table 2.
The Naïve RAG implements the simplest approaches both for chunking and embedding and hence it is
used as a baseline. The Semantic Chunker and the OpenAI Large Embedding RAG, instead, use more
advanced techniques for, respectively, chunking and embedding.</p>
        <sec id="sec-4-2-1">
          <title>5https://openai.com/index/gpt-4o-mini-advancing-cost-eficient-intelligence/ 6https://mistral.ai/news/announcing-mistral-7b/ 7https://mistral.ai/news/mixtral-of-experts/ 8https://www.llama.com/</title>
          <p>As regards the indexing phase, the comparison between the text-embedding-3-large model and the
text-embedding-3-small model seems to be interesting. This highlights improvements in key metrics
(e.g., AnswerRelevancyScore) in favor of the large model. In fact, text-embedding-3-large allows better
prioritization of the most relevant documents compared to the less significant ones, providing greater
completeness in retrieving the relevant documents and giving more aligned responses with the initial
input. In general, as also shown in the following comparisons, the adoption of text-embedding-3-large
leads to better performance.
4.2.2. Evaluation of the Retrieval phase
For what concerns the Retrieval phase, the results of the experiments are summarized in table 4,
highlighting the diferences between the Naïve RAG, which uses the semantic search for the Retrieval
phase, and the tested variants, namely the Cosine Similarity RAG, the Reranker RAG, and the Hybrid
Search RAG.</p>
          <p>The best approaches for the retrieval phase seem to be the Reranker and the Hybrid Search. In
fact, the latter shows the highest Contextual Relevancy Score, equal to 0.292. This result is consistent
with expectations, since the technique adopted excludes, among the selected chunks, those not strictly
relevant to the query. This approach also allows a significant increase in the Answer Relevancy (+4.07%).
However, there is a small reduction in the Faithfulness (-1.36%), which is the only metric where a
reduction in performance is noted compared to the baseline.</p>
          <p>It should also be noted that the use of cosine distance, compared to Euclidean distance, does not result
in significant diferences in the overall system performance.
4.2.3. Evaluation of the Generation phase
Regarding the generative phase, we carried out experiments on both GPT-4o-mini and open source
models, namely Mistral 7B-Instruct-v0.2, Mixtral-8x7B-Instruct-v0.1 and Llama-3.1-70B-Instruct. Table
5 shows the average values of the metrics calculated for each considered model.</p>
          <p>The results suggest that GPT-4o-mini outperforms the open source models. In particular, the model
shows superior performance in identifying relevant information and generating complete, accurate
answers. In contrast, the other models analyzed struggle with understanding the provided context or
responding to complex queries, often returning incomplete or null responses. Furthermore, while it
was expected that Mistral 7B would perform the worst among open-source models, as it is the model
with the lowest number of parameters, it was actually Mistral 7x8B that recorded the lowest results in
terms of response relevance to queries. In general, Llama 70B shows better performance than Mistral,
although this is understandable considering that it is the open source model with the highest number
of parameters among those tested.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we defined a pipeline for developing a question answering system based on
RetrievalAugmented Large Language Models and we experimentally evaluated several techniques for each
of its building blocks. The analysis was conducted by using a real-world dataset on Italian food as
the knowledge base and generating a set of questions at various levels of complexity. The generated
responses were evaluated from diferent perspectives by means of five metrics. For what concerns the
Retrieval phase, the experiments suggest that the use of a semantic chunker and a large embedding
model lead to better results in terms of precision, recall and relevancy. As regards the Retrieval phase,
the use of a reranker or the hybrid search seems to improve the relevancy of the generated responses,
thanks to better retrieval of relevant information from the knowledge base. Finally, as regards the
generation phase, the considered commercial LLM outperforms open source LLMs, in particular in
terms of answer relevancy and faithfulness.</p>
      <p>In future work, we plan to extend the experimentation by considering a larger set of documents in
the knowledge base, from which it will be possible to define a greater number of questions. Moreover,
we plan to generate the new dataset using a distinct LLM since questions and related answers are
currently generated by one of the LLMs that are also evaluated in Section 4.2, which may potentially
favor its performance by generating questions more similar to its own reasoning patterns, phrasing,
or knowledge scope. We are also interested in evaluating other commercial and open-source LLMs,
such as Anthropic Claude9 and DeepSeek [31]. Finally, we want to investigate the impact of some
key parameters, such as chunk size and chunk overlap between diferent source types, since their
optimization could potentially increase the performance of the Indexing and Retrieval phases.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Alex Mircoli and Maria Cristina Recchioni have received funding from the project Vitality – Project
Code ECS00000041, CUP I33C22001330007 - funded under the National Recovery and Resilience Plan
(NRRP), Mission 4 Component 2 Investment 1.5 - ’Creation and strengthening of innovation ecosystems,’
construction of ’territorial leaders in R&amp;D’ – Innovation Ecosystems - Project ’Innovation, digitalization
and sustainability for the difused economy in Central Italy – VITALITY’ Call for tender No. 3277 of
30/12/2021, and Concession Decree No. 0001057.23-06-2022 of Italian Ministry of University funded by
the European Union – NextGenerationEU.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>No GenAI tool was used during the preparation of this work.
[16] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances
in Neural Information Processing Systems 33 (2020) 9459–9474.
[17] W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, Q. Li, A survey on rag meeting llms:
Towards retrieval-augmented large language models, in: Proceedings of the 30th ACM SIGKDD
Conference on Knowledge Discovery and Data Mining, KDD ’24, Association for Computing
Machinery, New York, NY, USA, 2024, p. 6491–6501. URL: https://doi.org/10.1145/3637528.3671470.
doi:10.1145/3637528.3671470.
[18] P. Zhao, H. Zhang, Q. Yu, Z. Wang, Y. Geng, F. Fu, L. Yang, W. Zhang, B. Cui, Retrieval-augmented
generation for ai-generated content: A survey, arXiv preprint arXiv:2402.19473 (2024).
[19] K. Guu, K. Lee, Z. Tung, P. Pasupat, M.-W. Chang, Realm: retrieval-augmented language model
pre-training, in: Proceedings of the 37th International Conference on Machine Learning, 2020, pp.
3929–3938.
[20] O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-Brown, Y. Shoham,
Incontext retrieval-augmented language models, Transactions of the Association for Computational
Linguistics 11 (2023) 1316–1331.
[21] A. Asai, Z. Wu, Y. Wang, A. Sil, H. Hajishirzi, Self-rag: Learning to retrieve, generate, and critique
through self-reflection, in: The Twelfth International Conference on Learning Representations,
2023.
[22] G. Izacard, E. Grave, Leveraging passage retrieval with generative models for open domain
question answering, in: EACL 2021-16th Conference of the European Chapter of the Association
for Computational Linguistics, Association for Computational Linguistics, 2021, pp. 874–880.
[23] U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, M. Lewis, Generalization through
memorization: Nearest neighbor language models, arXiv preprint arXiv:1911.00172 (2019).
[24] Z. He, Z. Zhong, T. Cai, J. Lee, D. He, Rest: Retrieval-based speculative decoding, in: Proceedings
of the 2024 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 1582–1595.
[25] X. Ma, Y. Gong, P. He, H. Zhao, N. Duan, Query rewriting in retrieval-augmented large language
models, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing, 2023, pp. 5303–5315.
[26] V. Raina, M. Gales, Question-based retrieval using atomic units for enterprise rag, arXiv preprint
arXiv:2405.12363 (2024).
[27] P. Sarthi, S. Abdullah, A. Tuli, S. Khanna, A. Goldie, C. D. Manning, Raptor: Recursive abstractive
processing for tree-organized retrieval, in: The Twelfth International Conference on Learning
Representations, 2024.
[28] S. Zhuang, B. Liu, B. Koopman, G. Zuccon, Open-source large language models are strong zero-shot
query likelihood models for document ranking, in: The 2023 Conference on Empirical Methods in
Natural Language Processing, 2023.
[29] S. Hofstätter, J. Chen, K. Raman, H. Zamani, Fid-light: Eficient and efective retrieval-augmented
text generation, in: Proceedings of the 46th International ACM SIGIR Conference on Research
and Development in Information Retrieval, 2023, pp. 1437–1447.
[30] B. Peng, Y. Zhu, Y. Liu, X. Bo, H. Shi, C. Hong, Y. Zhang, S. Tang, Graph retrieval-augmented
generation: A survey, arXiv preprint arXiv:2408.08921 (2024).
[31] X. Bi, D. Chen, G. Chen, et al., Deepseek LLM: Scaling open-source language models with
longtermism, 2024. URL: https://arxiv.org/abs/2401.02954. arXiv:2401.02954.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Dam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Zhang,</surname>
          </string-name>
          <article-title>A complete survey on llm-based ai chatbots</article-title>
          ,
          <source>arXiv preprint arXiv:2406.16937</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nejjar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zacharias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Stiehle</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Weber</surname>
          </string-name>
          ,
          <article-title>Llms for science: Usage for code generation and data analysis</article-title>
          ,
          <source>Journal of Software: Evolution and Process</source>
          (
          <year>2023</year>
          )
          <article-title>e2723</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Avadhanam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Benchmarking llm powered chatbots: methods and metrics</article-title>
          ,
          <source>arXiv preprint arXiv:2308.04624</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chiorrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Diamantini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mircoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Potena</surname>
          </string-name>
          , E. Storti, Emotionalberto:
          <article-title>Emotion recognition of italian social media texts through bert</article-title>
          ,
          <source>in: 2022 26th International Conference on Pattern Recognition (ICPR)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1706</fpage>
          -
          <lpage>1711</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Perković</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Drobnjak</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Botički</surname>
          </string-name>
          ,
          <article-title>Hallucinations in llms: Understanding and addressing challenges, in: 2024 47th MIPRO ICT and Electronics Convention (MIPRO)</article-title>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>2084</fpage>
          -
          <lpage>2088</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lomeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dwivedi-Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          , E. Grave, Atlas:
          <article-title>Few-shot learning with retrieval augmented language models</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>24</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Evaluation of retrieval-augmented generation: A survey</article-title>
          , in: W. Zhu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xiong</surname>
          </string-name>
          , X. Cheng, L.
          <string-name>
            <surname>Cui</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Dou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Kong</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          Chen (Eds.),
          <source>Big Data</source>
          , Springer Nature Singapore, Singapore,
          <year>2025</year>
          , pp.
          <fpage>102</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , et al.,
          <article-title>A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions</article-title>
          ,
          <source>ACM Transactions on Information Systems</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Chen, LoRA:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          . URL: https://openreview.net/forum?id=nZeVKeeFYf9.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>O.</given-names>
            <surname>Honovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Scialom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          , T. Schick,
          <article-title>Unnatural instructions: Tuning language models with (almost) no human labor, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>14409</fpage>
          -
          <lpage>14428</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          , et al.,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>27730</fpage>
          -
          <lpage>27744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rafailov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , E. Mitchell,
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ermon</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Finn</surname>
          </string-name>
          ,
          <article-title>Direct preference optimization: Your language model is secretly a reward model</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.-C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cardie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-N.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <article-title>Visual prompt tuning</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>709</fpage>
          -
          <lpage>727</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          , I. Shafran,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          , React:
          <article-title>Synergizing reasoning and acting in language models</article-title>
          ,
          <source>in: International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>