<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Eighth Workshop on Natural Language for Artificial Intelligence, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>UniQA: an Italian and English Question-Answering Data Set Based on Educational Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Irene Siragusa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Pirrone</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, IT University of Copenhagen</institution>
          ,
          <addr-line>København S, 2300</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Engineering, University of Palermo</institution>
          ,
          <addr-line>Palermo, 90128, Sicily</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>2</volume>
      <fpage>6</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>In this paper we introduce UniQA, a high-quality Question-Answering data set that comprehends more than 1k documents and nearly 14k QA pairs. UniQA has been generated in a semi-automated manner using the data retrieved from the website of the University of Palermo, covering information about the bachelor and master degree courses for the academic year 2024/2025. Data are both in Italian and English, thus making the data set suitable for QA and translation models. To assess the data, we propose a Retrieval Augmented Generation model based on Llama-3.1-instruct. UniQA can be found at https://github.com/CHILab1/UniQA.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Question Answering</kwd>
        <kwd>RAG</kwd>
        <kwd>Large Language Modell</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        QA is a classical task in Natural Language Processing where a model is asked to answer to a question
relying on a given context. Unfortunately, annotated QA data sets and specifically the Italian ones are
not so common. A valuable example is SQuAD-it [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], derived by the English QA data set SQuAD [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
that collects more than 60k QA pairs obtained a via semi-automatic translation procedure. Generally
speaking, data sets obtained via translations can be useful when large quantity of native data in Italian
are not available but in general they do not have the quality of manually annotated ones. On the other
side, QUANDHO [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] represents a closed-domain QA data set built from native Italian texts that collects
627 questions manually classified, thus reaching a high level of data quality but its size is moderate. In
our work we want to fill this gap by creating a new QA data set with a considerably large number of
manually generated prompts for both questions and answers, which rely on structured data in both
Italian and English without using any translation procedure.
      </p>
      <p>
        QA is faced using LLMs by means of RAG to reduce both hallucinations and out-of-topic answers.
RAGbased applications [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] mainly present the same architectural structure, where a retrieval component,
typically a vector store, is used to save and retrieve documents related with the input, and a LLM-based
generator infers the answers according to a suited prompt strategy for the target application. Mostly of
the applications involves English, data and they are suitable for developing chat-bots or QA systems.
Interesting works in this field that use Italian involve applications whose main focus is building a virtual
assistant to help users in diverse tasks such as retrieving information about pregnancy [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or gaining
suggestions about how to write an Italian Funding Application [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], or obtaining real-time data in a
industrial context [11].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Data set description</title>
      <p>To build UniQA, we started from a web scraping procedure using both Selenium2 and Beautiful Soup3,
over the website of the University of Palermo4 thus collecting a total of 1048 documents containing
information about the bachelor and master degree courses for the academic year 2024/2025. In Table
1 are reported the number of documents collected in both the Italian and English splits along with
the total ones (JOINT). Both Italian and English documents are original ones, that were scraped from
the corresponding pages of the UniPA website either in the Italian or the English version, thus no
translation has been made from Italian to English to create the data set.</p>
      <p>For each available course, two documents have been generated, namely Course info and Course outline
that share an equal header, collecting general information about the course such as the type of degree,
the Department of afiliation, and the access rules. In Course info are reported also the educational
objectives, the professional opportunities, and the final examination rules for the specific course. Despite
the University ofers a total of 190 bachelor and master degrees, we collected 262 document couples.
Provided that a course can have multiple curricula, which difer from either some classes or the location
where the course is held, it was necessary to build both documents for each of them, causing small
overlapping and repetitions as for the Course info documents. Documents of the same type follow the
same architecture, thus allowing a semi-automated information-extraction for the generation of the QA
data set. In addition, we added to each document the following phrases "For more information visit the
course website [link]", and "Per maggiori informazioni consulta il sito del corso [link]" according to the
specific language split and reporting the link to the web-page of the course.</p>
      <p>Particularly, ten diferent QA prompts were generated (five prompts for each language split) that
are reported in Table 2 and 3, and refer to the common header shared by each Course info-Course</p>
      <sec id="sec-3-1">
        <title>2https://www.selenium.dev/</title>
        <p>3https://www.crummy.com/software/BeautifulSoup/bs4/doc/
4https://ofertaformativa.unipa.it/ofweb/public/corso/ricercaSemplice.seam
outline document couple. Moreover, six prompts (three prompts for each language split) were generated
specifically for each Course info document, that are reported in Table 4 and 5, and four prompts (two
prompts for each language split) for each Course outline document (see Table 6 and 7).</p>
        <p>The bachelor degree in course name for the academic year 2024/2025 is a
n-year course held in location at the department name. It is possible to
choose among one of the following curriculum: curriculum list. It is a
closed access course with number seats available. It is possible to obtain a
double degree with affiliate university.</p>
        <p>In the last group of prompts, the former asks for generic information about the list of classes held in a
target year, while the latter requests for specific information about a target class. As a consequence, the
number of generated QA pairs for each document depends on both the number of years of the bachelor
or master course and on the number of classes themselves. The following phrases "For more information
visit the course website [link]", and "Per maggiori informazioni vai su [link]" are further concatenated to
each answer of the generated QA pairs in both the English and the Italian split.</p>
        <p>We are aware about the limitations and redundancy of the generated data set with a small amount
of manually annotated templates for questions and answers. Despite this, our focus was towards
generating a data set suitable for fine-tuning a LLM for making it able to generate answers that are not
exact frames of the documents they have been generated from, but a re-paraphrased version. At the
end of the generation process, we collected a total of 13742 QA pairs, equally split in 6871 Italian pairs
and 6871 English pairs, as in Table 8.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Data split</title>
        <p>To stress the scientific interest of the developed data set, we provided also a list train-test split of the
data set that are interleaved with the language split. The resulting available splits are reported in Table
9. Starting from the Course info documents, we first selected all the unique bachelor and master degrees,
so that courses with multiple curricula were counted once, thus a set of 190 courses was obtained. Then
the courses were sub-grouped with respect to the Department they belong to, and for each sub-group, a
random 80-20 split was done to generate train and test groups. This procedure was implemented to
ensure that:
• bachelor and master courses with multiple curricula are considered as a unique block, and then
put together in either a train or test split;
• courses belonging to the same Department are equally divided to prevent any bias on the trained
models.</p>
        <p>Due to computational constraints on the training procedure of the generation LLM in our RAG
architecture, we created also a reduced split of the data set whose global input is less than 3000 tokens, and it
is 16% smaller than the original one, thus providing a not so significant reduction in performance.</p>
        <p>
          In all the splits we included the QA pairs as well as the original documents, thus allowing the
data set to be suitable for a large variety of NLP tasks, such as translation and QA with support of
external knowledge (QA-EK). In this paper we report the performance on a QA-EK task of a RAG-based
architecture based on Llama 3.1 both in the Foundational and Instruct version [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental setup</title>
        <p>We implemented a RAG-based architecture to perform QA-EK tasks on the UniQA data set in order to
evaluate the quality of our data with respect to the correctness of the provided answers that is also
related to the retrieval accuracy of the related documents. Such evaluations can be easily performed
since golden answers are known. Finally, this type of architecture, can be easily queried with
domainrelated questions that are not in UniQA data set: in this case, answers can be generated from the
retrieved documents, but evaluation can be trivial due to the lack of the corresponding golden answer.</p>
        <p>The implemented RAG-based architecture is illustrated in Figure 1, where two main components can
be distinguished: the retriever module and the generator LLM.</p>
        <p>Retriever module The retriever module is composed by a vector store and an Embeddings LLM. To
build it, we implemented a FAISS-based vector store [12] where the generated documents, both from
the train and test split, were injected after being splitted in 1000 token chunks with 100 overlapping
tokens, using tiktoken5 as the tokenizer. The token chunks are then processed by a LLM tailored
for embedding generation (Embeddings LLM), with retrieval capabilities, that supports both English
and Italian. Accordingly, we selected the best models that meet our constraints on computational
resources from the Massive Text Embedding Benchmark (MTEB) [13]6 namely BGE-M3 (BGE) [14],
gte-Qwen2-7B-instruct (GTE) [15] and Multilingual-E5-large-instruct (m-E5) [16]. All
of them where trained on multilingual data, including English and Italian: actually, it is not so simple to
select models that explicitly state that were trained also on Italian data. As the internal architecture, all
of them are build upon Transformers encoder [17], and both BGE and m-E5 are small 5M models, while
GTE is a 7B one. One vector database for each model was built using the LangChain framework7, and
their retrieval performances are reported in Section 5.</p>
        <p>
          Generator LLM We decided to stress the capabilities of Llama-3.1 8B models [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], the last
decoderonly generative model of the Llama models family, that has a native support for Italian and English
as well, and it is freely available. We tested both Foundational and Instruct models providing two
diferent English prompts: Prompt 1 is designed as standard instruction-prompt, and it is suitable for
Foundational models, while Prompt 2, follows the instruction prompt suggested for both Instruction
tuning and inference by the authors of the Llama 3.1 models [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Prompts are reported below.
        </p>
        <p>Prompt 1</p>
        <sec id="sec-4-2-1">
          <title>5https://github.com/openai/tiktoken 6https://huggingface.co/spaces/mteb/leaderboard 7https://www.langchain.com/langchain</title>
          <p>Below is an instruction that describes a task, paired with an input
that provides further context. Write a response that appropriately
completes the request.
### Instruction: You are Unipa-GPT, the chatbot and virtual assistant of the University
of Palermo. Provide a cordially and colloquially answers to the questions provided. If you
receive a greeting, answer by greeting and introducing yourself. If you receive a question
concerning the University of Palermo, answer relying on the documents given to you with
the question. If you do not know how to answer, apologize and suggest that you consult the
university website [https://www.unipa.it/], do not invent answers. If the question is in English,
answer in English. If the question is in Italian, answer in Italian.
### Input:
Question: question
Documents: context
### Response:
Prompt 2
You are Unipa-GPT, the chatbot and virtual assistant of the University of Palermo. Provide a
cordially and colloquially answers to the questions provided. If you receive a greeting, answer
by greeting and introducing yourself. If you receive a question concerning the University
of Palermo, answer relying on the documents given to you with the question. If you do
not know how to answer, apologize and suggest that you consult the university website
[https://www.unipa.it/], do not invent answers. If the question is in English, answer in English.
If the question is in Italian, answer in Italian.
Question: question
Documents: context</p>
          <p>Both the prompts were used for querying Foundational models, while Prompt 2 was used only for
Instruct models. Despite the multilingual task, we opted for an English prompt since it can be more
lfexible in a real-wold application where the language of both the questions and the documents provided
as inputs is not known a priori. After evaluating models in their base versions, we proceeded with a 3
epochs fine-tuning procedure over the best performing one, that is Llama-3.1-Instruct, using Prompt
2. Fine-tuning was performed with LoRA [18] following the Alpaca-LoRA hyper-parameters8 and we
will refer to this model as UniQA-3-ft. We trained the model using the prompt associated with the best
model, and providing the golden documents as context. We developed the entire system on a server
with two Intel(R) Xeon(R) Gold 6442Y CPUs, 384 GB RAM, and two 48 GB NVIDIA RTX 6000 Ada
Generation.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>To evaluate the retrieval performance, we performed at first a cluster analysis in the embedding space
relying on the “native” clustering of the documents, that can be divided by language (Italian or English)
or by department. The scatter plots of the embedding spaces for each Embeddings LLM are reported in
Figure 2 and 3 are reported. Such plots were obtained after a dimensionality reduction performed using
t-SNE [19]. Along with the graphical visualization, we provide an analytical analysis, calculating the
Silhouette Coeficient [20], as in Table 10.</p>
      <p>Both the graphical and the analytical representation highlight that m-E5 has the best clustering
capabilities in language separation (Figure 2.c) while the other models tend to overlap the embeddings.
Conversely, they perform better in grouping documents in a semantic way that is by Department.
Particularly, GTE outperforms the other models (Figure 3.b). Indeed, documents referring to Computer
and Mechanical Engineering degree courses, which are taught in the same Department have much more
in common than the ones concerning Nursing or Law. Moreover, the Italian description of a degree
course contains many English terms, and this can make it harder to cluster documents based on their
native language.</p>
      <p>The retrieval performances of the models were evaluated by querying their vector stores with question
samples belonging to the test set9: if at least one of the retrieved document matches the golden one
associated with the question, it was considered a correct retrieval. Thus, an accuracy measure was
computed as it is reported in Table 10: here the superiority of GTE is confirmed with an accuracy of
almost 86%, while BGE reaches an accuracy just over 81%, and m-E5 attains just 77%.</p>
      <p>QA evaluation, was performed by querying the models using the reduced joint version of the test set
(Table 11) the English only reduced split (Table 12), and the Italian only one (Table 13). Performance were
measured in terms of BLEU [21] and Rouge score [22]. Inference ability in Llama 3.1 Foundational, Llama
9we used the reduced split in this experiment just like in the ones devoted to QA evaluation
3.1 Instruct and UniQA-3-ft was assessed without any quantization strategy. Llama 3.1 Foundational
was queried using both Prompt 1 and Prompt 2, while Llama 3.1 Instruct and UniQA-3-ft were queried
using only Prompt 2, since they are Instruction fine-tuned models. The evaluation consisted in two
runs. In the first run, the golden context was provided in a one-shot scenario to the LLM without RAG,
while the second one made use of GTE-based retriever module. The former run was aimed at evaluating
the model inherent capabilities at generating a correct answer that adheres to the golden one, that is a
re-paraphrase of the context. The latter run was aimed at evaluating end-to-end performances of the
whole RAG architecture. We will refer to models that make use of RAG using the sufix retrieved.</p>
      <p>Prompt 3
You are Unipa-GPT, the chatbot and virtual assistant of the University of Palermo. Provide a
cordially and colloquially answers to the questions provided. If you receive a greeting, answer
by greeting and introducing yourself. If you do not know how to answer, apologize and suggest
that you consult the university website [https://www.unipa.it/], do not invent answers. If the
question is in English, answer in English. If the question is in Italian, answer in Italian.
Question: question</p>
      <p>To provide a more comprehensive overview of our contribution, we tested our UniQA-3-ft fine-tuned
model to asses its generation capabilities in a zero-shot scenario without RAG: we will refer to this
evaluation configuration as UniQA-3-ft no-RAG. A suitable version of the Prompt 2 was devised for this
purpose that we called Prompt 3, and does not contain any mention to rely on external documents for
answer generation. In Tables 11, 12 and 13, best results for each run are in bold, while italicised score
values have been used in the third run when UniQA-3-ft no-RAG performed better than UniQA-3-ft
retrieved.</p>
      <p>LLM</p>
      <p>As it was expected, UniQA-3-ft and UniQA-3-ft retrieved outperform the other models in their
respective runs, while their diference in performances is not so significant, and it mainly depends on
the quality of the retrieved documents. Llama 3.1 Foundational performs a bit better using Prompt 2
with respect to Prompt 1, and Llama 3.1 Instruct shows clearly its ability to follow instructions in both
settings.</p>
      <p>UniQA-3-ft no-RAG reaches comparable performances to UniQA-3-ft retrieved, and in some it scores
higher than the RAG version. This finding indicates clearly that UniQA is a high quality robust data set
that can be used to test both fine-tuned models and RAG architectures. It is worth noticing that the</p>
      <p>LLM
structure of the train-test split guarantees that the answers provided by UniQA-3-ft no-RAG leverage
only the knowledge acquired during the fine-tuning phase. In fact, when answering to a question
belonging to the test set, the model is completely unaware on the degree courses that are not in its
training set (Section 4.1). In this configuration, the inference capabilities of the model can be truly
tested since it is relying on the acquired knowledge from the QA pairs of similar degree courses in the
same Department.</p>
      <p>Via a manual inspection of some of the generated answers, we found that both the non fine-tuned
models and the fine-tuned ones, tend to output misspelled words, while both UniQA-3-ft no-RAG and the
non fine-tuned models provide incorrect answers since they have no access to a complete knowledge of
the UnPA domain, thus they reply leveraging their native and incmplete knowledge. The non fine-tuned
models tend to output verbose answers, and not to provide important information, thus wandering
of with a hypothetical course degree outline which is not required, and may be imprecise. Generally
speaking, the non fine-tuned models may output some correct information, but in a diferent format as
the one provided in the golden answer, thus making it more dificult to evaluate the overall correctness
of the generated replies.</p>
      <p>In both the golden answers and the retrieved documents a suggestion is reported for the final user to
visit the website of the degree course to get more information: models try to generate links following
the structure of the ones provided in the retrieved documents and in the prompt. The non fine-tuned
models fail since either no link is generated at all or the generated link does not refer to the UniPA
website. Fine-tuned models perform better, but not all the generated links are correct since misspellings
are quite common.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and future works</title>
      <p>In this paper we presented UniQA, a high-quality QA data set in Italian and English suitable for
translation and question-answering tasks where external knowledge is required. UniQA is a balanced
data set among the two languages, and it didn’t require any translation because it was scraped from
original Italian and English web pages of related to the degree courses issued at UniPA. UniQA counts
1048 documents and 13742 QA pairs generated in a semi-automated manner.</p>
      <p>We also tested a RAG-based architecture for QA with external knowledge tasks whose generation
LLMs were both Llama 3.1 Foundational and Llama 3.1 Instruct. Llama 3.1 was selected as a proof
of concept because it is recognized as a SOTA multilingual LLM, while both the fine-tuning and the
inference-only runs required a considerable amount of time on our local computational facilities. At
the time of submitting the manuscript, extensive tests are being run using also both Foundation and
Instruct LLMs that are based on diferent architectures than Llama as well as on the most known Italian
adaptations of such models.</p>
      <p>Future developments of this work are towards both extensive fine-tuning of the models under
investigation and on end-to-end training of the whole RAG architecture including the retriever. Finally,
a hybrid RAG architecture using both vector and graph databases is under development to encode both
(vector) semantic similarity between documents and their closeness with respect to a domain ontology
implemented as a graph of semantic relations between the documents in the corpus.
[11] R. Figliè, T. Turchi, G. Baldi, D. Mazzei, Towards an llm-based intelligent assistant for industry 5.0,
in: Proceedings of the 1st International Workshop on Designing and Building Hybrid Human–AI
Systems (SYNERGY 2024), volume 3701, 2024. URL: https://ceur-ws.org/Vol-3701/paper7.pdf.
[12] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with GPUs, IEEE Transactions on</p>
      <p>Big Data 7 (2019) 535–547.
[13] N. Muennighof, N. Tazi, L. Magne, N. Reimers, Mteb: Massive text embedding benchmark, arXiv
preprint arXiv:2210.07316 (2022). URL: https://arxiv.org/abs/2210.07316. doi:10.48550/ARXIV.
2210.07316.
[14] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, Z. Liu, Bge m3-embedding: Multi-lingual,
multifunctionality, multi-granularity text embeddings through self-knowledge distillation, 2024.
arXiv:2402.03216.
[15] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie, M. Zhang, Towards general text embeddings with
multi-stage contrastive learning, arXiv preprint arXiv:2308.03281 (2023).
[16] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, F. Wei, Multilingual e5 text embeddings: A
technical report, arXiv preprint arXiv:2402.05672 (2024).
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,</p>
      <p>Attention is all you need, Advances in neural information processing systems 30 (2017).
[18] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank
adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021).
[19] L. Van der Maaten, G. Hinton, Visualizing data using t-sne., Journal of machine learning research
9 (2008).
[20] P. J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,</p>
      <p>Journal of computational and applied mathematics 20 (1987) 53–65.
[21] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
translation, in: Proceedings of the 40th annual meeting of the Association for Computational
Linguistics, 2002, pp. 311–318.
[22] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization
branches out, 2004, pp. 74–81.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bonetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Hromei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Stranisci</surname>
          </string-name>
          , Preface to the
          <source>Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI)</source>
          ,
          <source>in: Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI</source>
          <year>2024</year>
          )
          <article-title>co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA</article-title>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] OpenAI, Gpt-4
          <source>technical report</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2303.08774. arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Anil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Alayrac</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Soricut</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Schalkwyk</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hauth</surname>
          </string-name>
          , et al.,
          <article-title>Gemini: a family of highly capable multimodal models</article-title>
          ,
          <source>arXiv preprint arXiv:2312.11805</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. . M.</given-names>
            <surname>Llama Team</surname>
          </string-name>
          ,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.21783. arXiv:
          <volume>2407</volume>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zelenanska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Basili</surname>
          </string-name>
          ,
          <article-title>Neural learning for question answering in italian</article-title>
          , in: C.
          <string-name>
            <surname>Ghidini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Magnini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Passerini</surname>
          </string-name>
          , P. Traverso (Eds.),
          <source>AI*IA 2018 - Advances in Artificial Intelligence</source>
          , Springer International Publishing, Cham,
          <year>2018</year>
          , pp.
          <fpage>389</fpage>
          -
          <lpage>402</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajpurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lopyrev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          , Squad:
          <volume>100</volume>
          ,000+
          <article-title>questions for machine comprehension of text, 2016</article-title>
          . URL: https://arxiv.org/abs/1606.05250. arXiv:
          <volume>1606</volume>
          .
          <fpage>05250</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Uva, “
          <article-title>who was pietro badoglio?” towards a QA system for Italian history</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Grobelnik</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
          </string-name>
          , S. Piperidis (Eds.),
          <source>Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)</source>
          ,
          <source>European Language Resources Association (ELRA)</source>
          , Portorož, Slovenia,
          <year>2016</year>
          , pp.
          <fpage>430</fpage>
          -
          <lpage>435</lpage>
          . URL: https://aclanthology.org/L16-1069.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghanbari Haez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Segala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bellan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Magnolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Consolandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dragoni</surname>
          </string-name>
          ,
          <article-title>A retrieval-augmented generation strategy to enhance medical chatbot reliability</article-title>
          , in: J.
          <string-name>
            <surname>Finkelstein</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Moskovitch</surname>
          </string-name>
          , E. Parimbelli (Eds.),
          <source>Artificial Intelligence in Medicine</source>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>223</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Boccato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferrante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Toschi</surname>
          </string-name>
          ,
          <article-title>Two-phase rag-based chatbot for italian funding application assistance</article-title>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>