=Paper= {{Paper |id=Vol-3877/16 |storemode=property |title=UniQA: an Italian and English Question-Answering Data Set Based on Educational Documents |pdfUrl=https://ceur-ws.org/Vol-3877/paper16.pdf |volume=Vol-3877 |authors=Irene Siragusa,Roberto Pirrone |dblpUrl=https://dblp.org/rec/conf/nl4ai/SiragusaP24 }} ==UniQA: an Italian and English Question-Answering Data Set Based on Educational Documents== https://ceur-ws.org/Vol-3877/paper16.pdf
                         UniQA: an Italian and English Question-Answering Data
                         Set Based on Educational Documents
                         Irene Siragusa1,2,* , Roberto Pirrone1
                         1
                             Department of Engineering, University of Palermo, Palermo, 90128, Sicily, Italy
                         2
                             Department of Computer Science, IT University of Copenhagen, København S, 2300, Denmark


                                       Abstract
                                        In this paper we introduce UniQA, a high-quality Question-Answering data set that comprehends more than 1k
                                        documents and nearly 14k QA pairs. UniQA has been generated in a semi-automated manner using the data
                                        retrieved from the website of the University of Palermo, covering information about the bachelor and master
                                        degree courses for the academic year 2024/2025. Data are both in Italian and English, thus making the data set
                                        suitable for QA and translation models. To assess the data, we propose a Retrieval Augmented Generation model
                                        based on Llama-3.1-instruct. UniQA can be found at https://github.com/CHILab1/UniQA.

                                        Keywords
                                        Question Answering, RAG, Large Language Modell




                         1. Introduction
                         The even more increasing interest towards both the implementation and the usage of Large Language
                         Models (LLM), involves not only the scientific community, but also users who are already acquainted
                         with models such as Chat-GPT [2] and Gemini [3], that allow for chat-based interaction. Despite a
                         general trust in those systems, it is clear that they are not so precise in answering to domain-specific
                         questions, at least without the usage of external methodologies such as fine-tuning or integrating external
                         knowledge via a Retrieval Augmented Generation (RAG) approach [4]. Moreover, the development and
                         the evaluation of chat-based applications aimed at providing the users with precise answers in a specific
                         domain is a quite difficult task. This is due to the lack of domain-specific annotated and high-quality
                         data sets, such as Question-Answer (QA) pairs.
                            To overcome this issue, we built UniQA1 , a balanced Italian and English QA data set suitable for
                         a domain-specific QA task where external knowledge is required. Our data set comprehends also a
                         corpus of 1000 documents extracted through a scraping procedure over the website of the University of
                         Palermo (UniPA), from which nearly 14k QA pairs have been generated in a semi-automatic manner.
                         In addition, we evaluated UniQA by building a RAG-based QA architecture based on the Llama-3.1
                         model [5] as text generator.
                            The paper is arranged as follows: related works are reported in Section 2, while details on the building
                         process of UniQA are reported in Section 3, and the experimental results obtained using Llama-3.1,
                         are reported and discussed in Section 4. Concluding remarks and future works are drawn in Section 6.


                         2. Related works
                         QA is a classical task in Natural Language Processing where a model is asked to answer to a question
                         relying on a given context. Unfortunately, annotated QA data sets and specifically the Italian ones are
                         not so common. A valuable example is SQuAD-it [6], derived by the English QA data set SQuAD [7],
                         that collects more than 60k QA pairs obtained a via semi-automatic translation procedure. Generally
                          NL4AI 2024: Eighth Workshop on Natural Language for Artificial Intelligence, November 26-27th, 2024, Bolzano, Italy [1]
                         *
                           Corresponding author.
                          $ irene.siragusa02@unipa.it (I. Siragusa); roberto.pirrone@unipa.it (R. Pirrone)
                           0009-0005-8434-8729 (I. Siragusa); 0000-0001-9453-510X (R. Pirrone)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                         1
                             https://github.com/CHILab1/UniQA

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
speaking, data sets obtained via translations can be useful when large quantity of native data in Italian
are not available but in general they do not have the quality of manually annotated ones. On the other
side, QUANDHO [8] represents a closed-domain QA data set built from native Italian texts that collects
627 questions manually classified, thus reaching a high level of data quality but its size is moderate. In
our work we want to fill this gap by creating a new QA data set with a considerably large number of
manually generated prompts for both questions and answers, which rely on structured data in both
Italian and English without using any translation procedure.
   QA is faced using LLMs by means of RAG to reduce both hallucinations and out-of-topic answers. RAG-
based applications [4] mainly present the same architectural structure, where a retrieval component,
typically a vector store, is used to save and retrieve documents related with the input, and a LLM-based
generator infers the answers according to a suited prompt strategy for the target application. Mostly of
the applications involves English, data and they are suitable for developing chat-bots or QA systems.
Interesting works in this field that use Italian involve applications whose main focus is building a virtual
assistant to help users in diverse tasks such as retrieving information about pregnancy [9] or gaining
suggestions about how to write an Italian Funding Application [10], or obtaining real-time data in a
industrial context [11].


3. Data set description
To build UniQA, we started from a web scraping procedure using both Selenium2 and Beautiful Soup3 ,
over the website of the University of Palermo4 thus collecting a total of 1048 documents containing
information about the bachelor and master degree courses for the academic year 2024/2025. In Table
1 are reported the number of documents collected in both the Italian and English splits along with
the total ones (JOINT). Both Italian and English documents are original ones, that were scraped from
the corresponding pages of the UniPA website either in the Italian or the English version, thus no
translation has been made from Italian to English to create the data set.

Table 1
Overview of the document splits built from scraping the UniPA website.
                                                # IT-split     # EN-split      # JOINT-split
                              Course info           262            262             524
                             Course outline         262            262             524
                                 # total            524            524             1048

  For each available course, two documents have been generated, namely Course info and Course outline
that share an equal header, collecting general information about the course such as the type of degree,
the Department of affiliation, and the access rules. In Course info are reported also the educational
objectives, the professional opportunities, and the final examination rules for the specific course. Despite
the University offers a total of 190 bachelor and master degrees, we collected 262 document couples.
Provided that a course can have multiple curricula, which differ from either some classes or the location
where the course is held, it was necessary to build both documents for each of them, causing small
overlapping and repetitions as for the Course info documents. Documents of the same type follow the
same architecture, thus allowing a semi-automated information-extraction for the generation of the QA
data set. In addition, we added to each document the following phrases "For more information visit the
course website [link]", and "Per maggiori informazioni consulta il sito del corso [link]" according to the
specific language split and reporting the link to the web-page of the course.
  Particularly, ten different QA prompts were generated (five prompts for each language split) that
are reported in Table 2 and 3, and refer to the common header shared by each Course info-Course
2
  https://www.selenium.dev/
3
  https://www.crummy.com/software/BeautifulSoup/bs4/doc/
4
  https://offertaformativa.unipa.it/offweb/public/corso/ricercaSemplice.seam
outline document couple. Moreover, six prompts (three prompts for each language split) were generated
specifically for each Course info document, that are reported in Table 4 and 5, and four prompts (two
prompts for each language split) for each Course outline document (see Table 6 and 7).

Table 2
List of the generated QA pairs, which leverage the common textual header in each document couple in the
English split.


                                             Course info / Course outline - English split
 Questions                                           Answers
 What are the available curriculum for the
                                                       There are no available curriculum for the bachelor degree in course name.
 bachelor degree in course name?

 Is the master degree in course name a open
                                                       The master degree in course name is a free access course.
 or closed access course?
 Where do the lessons of the bachelor degree
                                                       Lessons are held in location at the department name.
 in course name take place?
 Is it possible to obtain a double degree with a       No, it is not possible to obtain a double degree, but it is possible to participate
 master degree in course name?                         at the Erasmus program.
                                                       The bachelor degree in course name for the academic year 2024/2025 is a
                                                       n-year course held in location at the department name. It is possible to
 Provide me some information regarding the
                                                       choose among one of the following curriculum: curriculum list. It is a
 bachelor/master degree in course name.
                                                       closed access course with number seats available. It is possible to obtain a
                                                       double degree with affiliate university.




Table 3
List of the generated QA pairs, which leverage the common textual header in each document couple in the
Italian split


                                                Course info / Course outline - Italian split
 Questions                                           Answers
 Quali sono i curriculum disponibili per il corso di
                                                       Non sono disponibili curriculum per il corso triennale in nome corso.
 laurea triennale in nome corso?
 Il corso di laurea magistrale in nome corso
                                                       Il corso di laurea magistrale in nome corso è ad accesso libero.
 è a numero chiuso o ad accesso libero?"
 Dove si svolgnono le lezioni del corso di laurea
                                                       Le lezioni si svolgono presso la sede di luogo del nome dipartimento.
 triennale in nome corso?

 È possibile conseguire il doppio titolo con il        "No, non è possibile conseguire il doppio titolo, ma è possibile partecipare
 corso di laurea magistrale in nome corso?             al programma Erasmus.
                                                       Il corso di laurea triennale in nome corso per l’anno accademico 2024/2025
 Dammi delle informazioni sul corso di laurea          è un corso della durata di n anni presso la sede di luogo del nome dipartimento.
 triennale/magistrale in nome corso.                   È possibile scegliere uno dei seguenti curriculum: lista dei curriculum.
                                                       Il corso è a numero chiuso, sono disponibili n posti.




   In the last group of prompts, the former asks for generic information about the list of classes held in a
target year, while the latter requests for specific information about a target class. As a consequence, the
number of generated QA pairs for each document depends on both the number of years of the bachelor
or master course and on the number of classes themselves. The following phrases "For more information
visit the course website [link]", and "Per maggiori informazioni vai su [link]" are further concatenated to
each answer of the generated QA pairs in both the English and the Italian split.
   We are aware about the limitations and redundancy of the generated data set with a small amount
of manually annotated templates for questions and answers. Despite this, our focus was towards
generating a data set suitable for fine-tuning a LLM for making it able to generate answers that are not
exact frames of the documents they have been generated from, but a re-paraphrased version. At the
Table 4
List of the generated QA pairs for the Course info document, English split.


                                                     Course info - English split
 Questions                                       Answers
 What are the educational objectives of the
                                                 Educational objectives from the Course info document.
 master degree in course name?

 What are the professional opportunities
                                                 Professional opportunities from the Course info document.
 of the bachelor degree in course name?
 What are the features of the final
 examination of the master degree                Features of the final examination from the Course info document.
 in course name?




Table 5
List of the generated QA pairs for the Course info document, Italian split.


                                                     Course info - Italian split
 Questions                                         Answers
 Quali sono gli obiettivi formativi del corso
                                                   Obiettivi formativi dal documento course info.
 di laurea magistrale in nome corso?
 Quali sono gli sbocchi occupazionali che il
                                                   Sbocchi occupazionali dal documento course info.
 corso di laurea triennale in nome corso?
 Quali sono le caratteristiche della prova
 finale del corso di laurea magistrale in          Caratteristiche della prova finale dal documento course info.
 nome corso?




Table 6
List of the generated QA pairs for the Course outline document, English split.


                                                   Course outline - English split
 Questions                                      Answers
                                                Subjects of the target year of the bachelor degree in course name
 What are the subjects of the target
                                                curriculum name are: subjects list. A thesis is also expected to be
 year of the bachelor degree in
                                                conducted. It is possible to choose among the following teachings as for optional
 course name curriculum name?
                                                subjects optional subject list.
 Provide me some details regarding the          subject is a n-ECTS subject of the teaching year of the master degree
 teaching of subject from the master            in course name. Teaching is held by professor surname. Lessons will take
 degree in course name.                         place during the target semester.




end of the generation process, we collected a total of 13742 QA pairs, equally split in 6871 Italian pairs
and 6871 English pairs, as in Table 8.


4. Experiments
4.1. Data split
To stress the scientific interest of the developed data set, we provided also a list train-test split of the
data set that are interleaved with the language split. The resulting available splits are reported in Table
9. Starting from the Course info documents, we first selected all the unique bachelor and master degrees,
so that courses with multiple curricula were counted once, thus a set of 190 courses was obtained. Then
Table 7
List of the generated QA pairs for the Course outline document, Italian split.

                                              Course outline - Italian split
 Questions                                Answers
                                          Le materie del anno target del corso di laurea triennale in nome corso
 Quali sono le materie del anno target
                                          nome curriculum sono: lista delle materie. È inoltre previsto lo svolgimento
 del corso di laurea triennale in
                                          della tesi. All’interno del Gruppo di attività formative opzionali è possibile scegliere
 nome corso nome curriculum?
                                          tra le seguenti materie: lista materie opzionali.
 Dammi informazioni sulla materia         nome materia è una materia di n CFU del anno insegnamento del corso di
 nome materia del corso di laurea         laurea magistrale in nome coros. L’insegnamento è tenuto dal professore cognome.
 magistrale in nome corso.                Le lezioni si terranno nel target semestre.




Table 8
Distribution of the generated QA pairs in the two language splits.
                                                   # IT-split     # EN-split       # JOINT-split
                          Course info QA              1520             1520              3040
                         Course outline QA            5351             5351              10702
                               # total                6871             6871              13742


the courses were sub-grouped with respect to the Department they belong to, and for each sub-group, a
random 80-20 split was done to generate train and test groups. This procedure was implemented to
ensure that:

    • bachelor and master courses with multiple curricula are considered as a unique block, and then
      put together in either a train or test split;
    • courses belonging to the same Department are equally divided to prevent any bias on the trained
      models.

Due to computational constraints on the training procedure of the generation LLM in our RAG archi-
tecture, we created also a reduced split of the data set whose global input is less than 3000 tokens, and it
is 16% smaller than the original one, thus providing a not so significant reduction in performance.

Table 9
List of the UniQA train-test splits grouped in Italian, English and Joint split
                                                              #Train      #Test       #All
                                     Documents-IT               398        126       524
                                     Documents-EN               398        126       524
                                    Documents-JOINT             796        252       1048
                                          QAs-IT                5738       1709       7447
                                          QAs-EN                5738       1709       7447
                                         QAs-JOINT             11476       3418      14894
                                     QAs-IT-reduced             4293       1249       5542
                                     QAs-EN-reduced             5303       1556       6859
                                    QAs-JOINT-reduced           9596       2805      12401

  In all the splits we included the QA pairs as well as the original documents, thus allowing the
data set to be suitable for a large variety of NLP tasks, such as translation and QA with support of
external knowledge (QA-EK). In this paper we report the performance on a QA-EK task of a RAG-based
architecture based on Llama 3.1 both in the Foundational and Instruct version [5].
Figure 1: Schema of the implemented architecture for QA.


4.2. Experimental setup
We implemented a RAG-based architecture to perform QA-EK tasks on the UniQA data set in order to
evaluate the quality of our data with respect to the correctness of the provided answers that is also
related to the retrieval accuracy of the related documents. Such evaluations can be easily performed
since golden answers are known. Finally, this type of architecture, can be easily queried with domain-
related questions that are not in UniQA data set: in this case, answers can be generated from the
retrieved documents, but evaluation can be trivial due to the lack of the corresponding golden answer.
   The implemented RAG-based architecture is illustrated in Figure 1, where two main components can
be distinguished: the retriever module and the generator LLM.
   Retriever module The retriever module is composed by a vector store and an Embeddings LLM. To
build it, we implemented a FAISS-based vector store [12] where the generated documents, both from
the train and test split, were injected after being splitted in 1000 token chunks with 100 overlapping
tokens, using tiktoken5 as the tokenizer. The token chunks are then processed by a LLM tailored
for embedding generation (Embeddings LLM), with retrieval capabilities, that supports both English
and Italian. Accordingly, we selected the best models that meet our constraints on computational
resources from the Massive Text Embedding Benchmark (MTEB) [13]6 namely BGE-M3 (BGE) [14],
gte-Qwen2-7B-instruct (GTE) [15] and Multilingual-E5-large-instruct (m-E5) [16]. All
of them where trained on multilingual data, including English and Italian: actually, it is not so simple to
select models that explicitly state that were trained also on Italian data. As the internal architecture, all
of them are build upon Transformers encoder [17], and both BGE and m-E5 are small 5M models, while
GTE is a 7B one. One vector database for each model was built using the LangChain framework7 , and
their retrieval performances are reported in Section 5.
   Generator LLM We decided to stress the capabilities of Llama-3.1 8B models [5], the last decoder-
only generative model of the Llama models family, that has a native support for Italian and English
as well, and it is freely available. We tested both Foundational and Instruct models providing two
different English prompts: Prompt 1 is designed as standard instruction-prompt, and it is suitable for
Foundational models, while Prompt 2, follows the instruction prompt suggested for both Instruction
tuning and inference by the authors of the Llama 3.1 models [5]. Prompts are reported below.

       Prompt 1

5
  https://github.com/openai/tiktoken
6
  https://huggingface.co/spaces/mteb/leaderboard
7
  https://www.langchain.com/langchain
          Below is an instruction that describes a task, paired with an input
          that provides further context. Write a response that appropriately
          completes the request.
          ### Instruction: You are Unipa-GPT, the chatbot and virtual assistant of the University
          of Palermo. Provide a cordially and colloquially answers to the questions provided. If you
          receive a greeting, answer by greeting and introducing yourself. If you receive a question
          concerning the University of Palermo, answer relying on the documents given to you with
          the question. If you do not know how to answer, apologize and suggest that you consult the
          university website [https://www.unipa.it/], do not invent answers. If the question is in English,
          answer in English. If the question is in Italian, answer in Italian.
          ### Input:
          Question: question
          Documents: context
          ### Response:

          Prompt 2
          <|start_header_id|>system<|end_header_id|>
          You are Unipa-GPT, the chatbot and virtual assistant of the University of Palermo. Provide a
          cordially and colloquially answers to the questions provided. If you receive a greeting, answer
          by greeting and introducing yourself. If you receive a question concerning the University
          of Palermo, answer relying on the documents given to you with the question. If you do
          not know how to answer, apologize and suggest that you consult the university website
          [https://www.unipa.it/], do not invent answers. If the question is in English, answer in English.
          If the question is in Italian, answer in Italian.
          <|eot_id|><|start_header_id|>user<|end_header_id|>
          Question: question
          Documents: context
          <|eot_id|><|start_header_id|>assistant<|end_header_id|>

   Both the prompts were used for querying Foundational models, while Prompt 2 was used only for
Instruct models. Despite the multilingual task, we opted for an English prompt since it can be more
flexible in a real-wold application where the language of both the questions and the documents provided
as inputs is not known a priori. After evaluating models in their base versions, we proceeded with a 3
epochs fine-tuning procedure over the best performing one, that is Llama-3.1-Instruct, using Prompt
2. Fine-tuning was performed with LoRA [18] following the Alpaca-LoRA hyper-parameters8 and we
will refer to this model as UniQA-3-ft. We trained the model using the prompt associated with the best
model, and providing the golden documents as context. We developed the entire system on a server
with two Intel(R) Xeon(R) Gold 6442Y CPUs, 384 GB RAM, and two 48 GB NVIDIA RTX 6000 Ada
Generation.


5. Results
To evaluate the retrieval performance, we performed at first a cluster analysis in the embedding space
relying on the “native” clustering of the documents, that can be divided by language (Italian or English)
or by department. The scatter plots of the embedding spaces for each Embeddings LLM are reported in
Figure 2 and 3 are reported. Such plots were obtained after a dimensionality reduction performed using

8
    https://github.com/tloen/alpaca-lora
Figure 2: Scatter plots of 2D reduced embedding spaces, labeled for each language.




Figure 3: Scatter plots of 2D reduced embedding spaces, labeled for each Department.


t-SNE [19]. Along with the graphical visualization, we provide an analytical analysis, calculating the
Silhouette Coefficient [20], as in Table 10.
   Both the graphical and the analytical representation highlight that m-E5 has the best clustering
capabilities in language separation (Figure 2.c) while the other models tend to overlap the embeddings.
Conversely, they perform better in grouping documents in a semantic way that is by Department.
Particularly, GTE outperforms the other models (Figure 3.b). Indeed, documents referring to Computer
and Mechanical Engineering degree courses, which are taught in the same Department have much more
in common than the ones concerning Nursing or Law. Moreover, the Italian description of a degree
course contains many English terms, and this can make it harder to cluster documents based on their
native language.
   The retrieval performances of the models were evaluated by querying their vector stores with question
samples belonging to the test set9 : if at least one of the retrieved document matches the golden one
associated with the question, it was considered a correct retrieval. Thus, an accuracy measure was
computed as it is reported in Table 10: here the superiority of GTE is confirmed with an accuracy of
almost 86%, while BGE reaches an accuracy just over 81%, and m-E5 attains just 77%.

Table 10
Clustering and retrieval performances of the three embedding LLM that we tested. S-Score is used for clustering,
while retrieval performances are expressed in terms of accuracy. Best results are in bold.
     Retriever    S-Score language     S-Score departments      Exact matches      Total matches   Accuracy (Total/Exact)
       BGE               0.0057                -0.0883                2283              2805             81.3904%
       GTE              -0.0048                0.0470                 2408              2805             85.8467%
       m-E5             0.2237                 -0.1007                2159              2805             76.9697%


  QA evaluation, was performed by querying the models using the reduced joint version of the test set
(Table 11) the English only reduced split (Table 12), and the Italian only one (Table 13). Performance were
measured in terms of BLEU [21] and Rouge score [22]. Inference ability in Llama 3.1 Foundational, Llama
9
    we used the reduced split in this experiment just like in the ones devoted to QA evaluation
3.1 Instruct and UniQA-3-ft was assessed without any quantization strategy. Llama 3.1 Foundational
was queried using both Prompt 1 and Prompt 2, while Llama 3.1 Instruct and UniQA-3-ft were queried
using only Prompt 2, since they are Instruction fine-tuned models. The evaluation consisted in two
runs. In the first run, the golden context was provided in a one-shot scenario to the LLM without RAG,
while the second one made use of GTE-based retriever module. The former run was aimed at evaluating
the model inherent capabilities at generating a correct answer that adheres to the golden one, that is a
re-paraphrase of the context. The latter run was aimed at evaluating end-to-end performances of the
whole RAG architecture. We will refer to models that make use of RAG using the suffix retrieved.

      Prompt 3
      <|start_header_id|>system<|end_header_id|>
      You are Unipa-GPT, the chatbot and virtual assistant of the University of Palermo. Provide a
      cordially and colloquially answers to the questions provided. If you receive a greeting, answer
      by greeting and introducing yourself. If you do not know how to answer, apologize and suggest
      that you consult the university website [https://www.unipa.it/], do not invent answers. If the
      question is in English, answer in English. If the question is in Italian, answer in Italian.
      <|eot_id|><|start_header_id|>user<|end_header_id|>
      Question: question
      <|eot_id|><|start_header_id|>assistant<|end_header_id|>

   To provide a more comprehensive overview of our contribution, we tested our UniQA-3-ft fine-tuned
model to asses its generation capabilities in a zero-shot scenario without RAG: we will refer to this
evaluation configuration as UniQA-3-ft no-RAG. A suitable version of the Prompt 2 was devised for this
purpose that we called Prompt 3, and does not contain any mention to rely on external documents for
answer generation. In Tables 11, 12 and 13, best results for each run are in bold, while italicised score
values have been used in the third run when UniQA-3-ft no-RAG performed better than UniQA-3-ft
retrieved.

    Table 11
    BLEU and ROUGE score for the different LLMs used for evaluation of the QAs-JOINT-reduced split.

           LLM               Prompt      BLEU      Rouge-1      Rouge-2      Rouge-L       Rouge-Lsum
        Llama 3.1               1        0.0043     0.0913       0.0167        0.0557         0.0648
        Llama 3.1               2        0.0035     0.0710       0.0132        0.0455         0.0544
      Llama 3.1 inst            2        0.0328     0.1636       0.0370        0.1017         0.1291
       UniQA-3-ft               2        0.2730     0.5390       0.3780        0.5217         0.5223
    Llama 3.1 retrieved         1        0.0043     0.0914       0.0167        0.0556         0.0647
    Llama 3.1 retrieved         2        0.0030     0.0332       0.0036        0.0265         0.0298
  Llama 3.1 inst retrieved      2        0.0070     0.1204       0.0250        0.0800         0.0992
   UniQA-3-ft retrieved         2        0.1646     0.3548       0.2075        0.3312         0.3332
    UniQA-3-ft no-RAG           3        0.1322      0.3777       0.2007       0.3288          0.3333

   As it was expected, UniQA-3-ft and UniQA-3-ft retrieved outperform the other models in their
respective runs, while their difference in performances is not so significant, and it mainly depends on
the quality of the retrieved documents. Llama 3.1 Foundational performs a bit better using Prompt 2
with respect to Prompt 1, and Llama 3.1 Instruct shows clearly its ability to follow instructions in both
settings.
   UniQA-3-ft no-RAG reaches comparable performances to UniQA-3-ft retrieved, and in some it scores
higher than the RAG version. This finding indicates clearly that UniQA is a high quality robust data set
that can be used to test both fine-tuned models and RAG architectures. It is worth noticing that the
    Table 12
    BLEU and ROUGE score for the different LLMs used for evaluation of the QAs-EN-reduced split.

           LLM               Prompt      BLEU      Rouge-1      Rouge-2      Rouge-L      Rouge-Lsum
        Llama 3.1               1       0.0016      0.1103       0.0230       0.0635         0.0744
        Llama 3.1               2       0.0017      0.0913       0.0168       0.0540         0.0664
      Llama 3.1 inst            2       0.0313      0.1445       0.0401       0.0945         0.1184
       UniQA-3-ft               2       0.2900      0.5813       0.4385       0.5641         0.5648
    Llama 3.1 retrieved         1       0.0016      0.1103       0.0230       0.0635         0.0744
    Llama 3.1 retrieved         2       0.0021      0.0442       0.0040       0.0346         0.0395
  Llama 3.1 inst retrieved      2       0.0064      0.1047       0.0252       0.0728         0.0888
   UniQA-3-ft retrieved         2       0.1555      0.3542       0.2169       0.3271         0.3304
    UniQA-3-ft no-RAG           3        0.1526      0.4191      0.2478       0.3687          0.3751

    Table 13
    BLEU and ROUGE score for the different LLMs used for evaluation of the QAs-IT-reduced split.

           LLM               Prompt      BLEU      Rouge-1      Rouge-2      Rouge-L      Rouge-Lsum
        Llama 3.1               1       0.0082      0.0677       0.0089       0.0459         0.0527
        Llama 3.1               2       0.0055      0.0455       0.0088       0.0347         0.0396
      Llama 3.1 inst            2       0.0349      0.1876       0.0331       0.1106         0.1425
       UniQA-3-ft               2       0.2433      0.4850       0.3029       0.4680         0.4687
    Llama 3.1 retrieved         1       0.0082      0.0677       0.0090       0.0459         0.0527
    Llama 3.1 retrieved         2       0.0036      0.0195       0.0032       0.0164         0.0178
  Llama 3.1 inst retrieved      2       0.0079      0.1402       0.0246       0.0890         0.1121
   UniQA-3-ft retrieved         2       0.1701      0.3543       0.1963       0.3357         0.3373
    UniQA-3-ft no-RAG           3        0.1023     0.3260       0.1421       0.2789          0.2817


structure of the train-test split guarantees that the answers provided by UniQA-3-ft no-RAG leverage
only the knowledge acquired during the fine-tuning phase. In fact, when answering to a question
belonging to the test set, the model is completely unaware on the degree courses that are not in its
training set (Section 4.1). In this configuration, the inference capabilities of the model can be truly
tested since it is relying on the acquired knowledge from the QA pairs of similar degree courses in the
same Department.
   Via a manual inspection of some of the generated answers, we found that both the non fine-tuned
models and the fine-tuned ones, tend to output misspelled words, while both UniQA-3-ft no-RAG and the
non fine-tuned models provide incorrect answers since they have no access to a complete knowledge of
the UnPA domain, thus they reply leveraging their native and incmplete knowledge. The non fine-tuned
models tend to output verbose answers, and not to provide important information, thus wandering
off with a hypothetical course degree outline which is not required, and may be imprecise. Generally
speaking, the non fine-tuned models may output some correct information, but in a different format as
the one provided in the golden answer, thus making it more difficult to evaluate the overall correctness
of the generated replies.
   In both the golden answers and the retrieved documents a suggestion is reported for the final user to
visit the website of the degree course to get more information: models try to generate links following
the structure of the ones provided in the retrieved documents and in the prompt. The non fine-tuned
models fail since either no link is generated at all or the generated link does not refer to the UniPA
website. Fine-tuned models perform better, but not all the generated links are correct since misspellings
are quite common.
6. Conclusions and future works
In this paper we presented UniQA, a high-quality QA data set in Italian and English suitable for
translation and question-answering tasks where external knowledge is required. UniQA is a balanced
data set among the two languages, and it didn’t require any translation because it was scraped from
original Italian and English web pages of related to the degree courses issued at UniPA. UniQA counts
1048 documents and 13742 QA pairs generated in a semi-automated manner.
   We also tested a RAG-based architecture for QA with external knowledge tasks whose generation
LLMs were both Llama 3.1 Foundational and Llama 3.1 Instruct. Llama 3.1 was selected as a proof
of concept because it is recognized as a SOTA multilingual LLM, while both the fine-tuning and the
inference-only runs required a considerable amount of time on our local computational facilities. At
the time of submitting the manuscript, extensive tests are being run using also both Foundation and
Instruct LLMs that are based on different architectures than Llama as well as on the most known Italian
adaptations of such models.
   Future developments of this work are towards both extensive fine-tuning of the models under
investigation and on end-to-end training of the whole RAG architecture including the retriever. Finally,
a hybrid RAG architecture using both vector and graph databases is under development to encode both
(vector) semantic similarity between documents and their closeness with respect to a domain ontology
implemented as a graph of semantic relations between the documents in the corpus.


References
 [1] G. Bonetta, C. D. Hromei, L. Siciliani, M. A. Stranisci, Preface to the Eighth Workshop on Natural
     Language for Artificial Intelligence (NL4AI), in: Proceedings of the Eighth Workshop on Natural
     Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of
     the Italian Association for Artificial Intelligence (AI*IA 2024), 2024.
 [2] OpenAI, Gpt-4 technical report, 2024. URL: https://arxiv.org/abs/2303.08774. arXiv:2303.08774.
 [3] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M.
     Dai, A. Hauth, et al., Gemini: a family of highly capable multimodal models, arXiv preprint
     arXiv:2312.11805 (2023).
 [4] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
     T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances
     in Neural Information Processing Systems 33 (2020) 9459–9474.
 [5] A. . M. Llama Team, The llama 3 herd of models, 2024. URL: https://arxiv.org/abs/2407.21783.
     arXiv:2407.21783.
 [6] D. Croce, A. Zelenanska, R. Basili, Neural learning for question answering in italian, in: C. Ghidini,
     B. Magnini, A. Passerini, P. Traverso (Eds.), AI*IA 2018 – Advances in Artificial Intelligence,
     Springer International Publishing, Cham, 2018, pp. 389–402.
 [7] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100,000+ questions for machine comprehension
     of text, 2016. URL: https://arxiv.org/abs/1606.05250. arXiv:1606.05250.
 [8] S. Menini, R. Sprugnoli, A. Uva, “who was pietro badoglio?” towards a QA system for Italian
     history, in: N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani,
     H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the Tenth International Conference
     on Language Resources and Evaluation (LREC’16), European Language Resources Association
     (ELRA), Portorož, Slovenia, 2016, pp. 430–435. URL: https://aclanthology.org/L16-1069.
 [9] S. Ghanbari Haez, M. Segala, P. Bellan, S. Magnolini, L. Sanna, M. Consolandi, M. Dragoni, A
     retrieval-augmented generation strategy to enhance medical chatbot reliability, in: J. Finkelstein,
     R. Moskovitch, E. Parimbelli (Eds.), Artificial Intelligence in Medicine, Springer Nature Switzerland,
     Cham, 2024, pp. 213–223.
[10] T. Boccato, M. Ferrante, N. Toschi, Two-phase rag-based chatbot for italian funding application
     assistance, 2024.
[11] R. Figliè, T. Turchi, G. Baldi, D. Mazzei, Towards an llm-based intelligent assistant for industry 5.0,
     in: Proceedings of the 1st International Workshop on Designing and Building Hybrid Human–AI
     Systems (SYNERGY 2024), volume 3701, 2024. URL: https://ceur-ws.org/Vol-3701/paper7.pdf.
[12] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with GPUs, IEEE Transactions on
     Big Data 7 (2019) 535–547.
[13] N. Muennighoff, N. Tazi, L. Magne, N. Reimers, Mteb: Massive text embedding benchmark, arXiv
     preprint arXiv:2210.07316 (2022). URL: https://arxiv.org/abs/2210.07316. doi:10.48550/ARXIV.
     2210.07316.
[14] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, Z. Liu, Bge m3-embedding: Multi-lingual, multi-
     functionality, multi-granularity text embeddings through self-knowledge distillation, 2024.
     arXiv:2402.03216.
[15] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie, M. Zhang, Towards general text embeddings with
     multi-stage contrastive learning, arXiv preprint arXiv:2308.03281 (2023).
[16] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, F. Wei, Multilingual e5 text embeddings: A
     technical report, arXiv preprint arXiv:2402.05672 (2024).
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
     Attention is all you need, Advances in neural information processing systems 30 (2017).
[18] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank
     adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021).
[19] L. Van der Maaten, G. Hinton, Visualizing data using t-sne., Journal of machine learning research
     9 (2008).
[20] P. J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,
     Journal of computational and applied mathematics 20 (1987) 53–65.
[21] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
     translation, in: Proceedings of the 40th annual meeting of the Association for Computational
     Linguistics, 2002, pp. 311–318.
[22] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization
     branches out, 2004, pp. 74–81.