1. Introduction

Eighth Workshop on Natural Language for Artificial Intelligence, November

UniQA: an Italian and English Question-Answering Data Set Based on Educational Documents

Irene Siragusa

0 1

Roberto Pirrone

1 0 Department of Computer Science, IT University of Copenhagen , København S, 2300 , Denmark 1 Department of Engineering, University of Palermo , Palermo, 90128, Sicily , Italy

2024

2 6 27

In this paper we introduce UniQA, a high-quality Question-Answering data set that comprehends more than 1k documents and nearly 14k QA pairs. UniQA has been generated in a semi-automated manner using the data retrieved from the website of the University of Palermo, covering information about the bachelor and master degree courses for the academic year 2024/2025. Data are both in Italian and English, thus making the data set suitable for QA and translation models. To assess the data, we propose a Retrieval Augmented Generation model based on Llama-3.1-instruct. UniQA can be found at https://github.com/CHILab1/UniQA.

eol>Question Answering RAG Large Language Modell

1. Introduction 2. Related works

QA is a classical task in Natural Language Processing where a model is asked to answer to a question relying on a given context. Unfortunately, annotated QA data sets and specifically the Italian ones are not so common. A valuable example is SQuAD-it [ 6 ], derived by the English QA data set SQuAD [ 7 ], that collects more than 60k QA pairs obtained a via semi-automatic translation procedure. Generally speaking, data sets obtained via translations can be useful when large quantity of native data in Italian are not available but in general they do not have the quality of manually annotated ones. On the other side, QUANDHO [ 8 ] represents a closed-domain QA data set built from native Italian texts that collects 627 questions manually classified, thus reaching a high level of data quality but its size is moderate. In our work we want to fill this gap by creating a new QA data set with a considerably large number of manually generated prompts for both questions and answers, which rely on structured data in both Italian and English without using any translation procedure.

QA is faced using LLMs by means of RAG to reduce both hallucinations and out-of-topic answers. RAGbased applications [ 4 ] mainly present the same architectural structure, where a retrieval component, typically a vector store, is used to save and retrieve documents related with the input, and a LLM-based generator infers the answers according to a suited prompt strategy for the target application. Mostly of the applications involves English, data and they are suitable for developing chat-bots or QA systems. Interesting works in this field that use Italian involve applications whose main focus is building a virtual assistant to help users in diverse tasks such as retrieving information about pregnancy [ 9 ] or gaining suggestions about how to write an Italian Funding Application [ 10 ], or obtaining real-time data in a industrial context [11].

3. Data set description

To build UniQA, we started from a web scraping procedure using both Selenium2 and Beautiful Soup3, over the website of the University of Palermo4 thus collecting a total of 1048 documents containing information about the bachelor and master degree courses for the academic year 2024/2025. In Table 1 are reported the number of documents collected in both the Italian and English splits along with the total ones (JOINT). Both Italian and English documents are original ones, that were scraped from the corresponding pages of the UniPA website either in the Italian or the English version, thus no translation has been made from Italian to English to create the data set.

For each available course, two documents have been generated, namely Course info and Course outline that share an equal header, collecting general information about the course such as the type of degree, the Department of afiliation, and the access rules. In Course info are reported also the educational objectives, the professional opportunities, and the final examination rules for the specific course. Despite the University ofers a total of 190 bachelor and master degrees, we collected 262 document couples. Provided that a course can have multiple curricula, which difer from either some classes or the location where the course is held, it was necessary to build both documents for each of them, causing small overlapping and repetitions as for the Course info documents. Documents of the same type follow the same architecture, thus allowing a semi-automated information-extraction for the generation of the QA data set. In addition, we added to each document the following phrases "For more information visit the course website [link]", and "Per maggiori informazioni consulta il sito del corso [link]" according to the specific language split and reporting the link to the web-page of the course.

Particularly, ten diferent QA prompts were generated (five prompts for each language split) that are reported in Table 2 and 3, and refer to the common header shared by each Course info-Course

2https://www.selenium.dev/

3https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 4https://ofertaformativa.unipa.it/ofweb/public/corso/ricercaSemplice.seam outline document couple. Moreover, six prompts (three prompts for each language split) were generated specifically for each Course info document, that are reported in Table 4 and 5, and four prompts (two prompts for each language split) for each Course outline document (see Table 6 and 7).

The bachelor degree in course name for the academic year 2024/2025 is a n-year course held in location at the department name. It is possible to choose among one of the following curriculum: curriculum list. It is a closed access course with number seats available. It is possible to obtain a double degree with affiliate university.

In the last group of prompts, the former asks for generic information about the list of classes held in a target year, while the latter requests for specific information about a target class. As a consequence, the number of generated QA pairs for each document depends on both the number of years of the bachelor or master course and on the number of classes themselves. The following phrases "For more information visit the course website [link]", and "Per maggiori informazioni vai su [link]" are further concatenated to each answer of the generated QA pairs in both the English and the Italian split.

We are aware about the limitations and redundancy of the generated data set with a small amount of manually annotated templates for questions and answers. Despite this, our focus was towards generating a data set suitable for fine-tuning a LLM for making it able to generate answers that are not exact frames of the documents they have been generated from, but a re-paraphrased version. At the end of the generation process, we collected a total of 13742 QA pairs, equally split in 6871 Italian pairs and 6871 English pairs, as in Table 8.

4. Experiments 4.1. Data split

To stress the scientific interest of the developed data set, we provided also a list train-test split of the data set that are interleaved with the language split. The resulting available splits are reported in Table 9. Starting from the Course info documents, we first selected all the unique bachelor and master degrees, so that courses with multiple curricula were counted once, thus a set of 190 courses was obtained. Then the courses were sub-grouped with respect to the Department they belong to, and for each sub-group, a random 80-20 split was done to generate train and test groups. This procedure was implemented to ensure that: • bachelor and master courses with multiple curricula are considered as a unique block, and then put together in either a train or test split; • courses belonging to the same Department are equally divided to prevent any bias on the trained models.

Due to computational constraints on the training procedure of the generation LLM in our RAG architecture, we created also a reduced split of the data set whose global input is less than 3000 tokens, and it is 16% smaller than the original one, thus providing a not so significant reduction in performance.

In all the splits we included the QA pairs as well as the original documents, thus allowing the data set to be suitable for a large variety of NLP tasks, such as translation and QA with support of external knowledge (QA-EK). In this paper we report the performance on a QA-EK task of a RAG-based architecture based on Llama 3.1 both in the Foundational and Instruct version [ 5 ].

4.2. Experimental setup

We implemented a RAG-based architecture to perform QA-EK tasks on the UniQA data set in order to evaluate the quality of our data with respect to the correctness of the provided answers that is also related to the retrieval accuracy of the related documents. Such evaluations can be easily performed since golden answers are known. Finally, this type of architecture, can be easily queried with domainrelated questions that are not in UniQA data set: in this case, answers can be generated from the retrieved documents, but evaluation can be trivial due to the lack of the corresponding golden answer.

The implemented RAG-based architecture is illustrated in Figure 1, where two main components can be distinguished: the retriever module and the generator LLM.

Retriever module The retriever module is composed by a vector store and an Embeddings LLM. To build it, we implemented a FAISS-based vector store [12] where the generated documents, both from the train and test split, were injected after being splitted in 1000 token chunks with 100 overlapping tokens, using tiktoken5 as the tokenizer. The token chunks are then processed by a LLM tailored for embedding generation (Embeddings LLM), with retrieval capabilities, that supports both English and Italian. Accordingly, we selected the best models that meet our constraints on computational resources from the Massive Text Embedding Benchmark (MTEB) [13]6 namely BGE-M3 (BGE) [14], gte-Qwen2-7B-instruct (GTE) [15] and Multilingual-E5-large-instruct (m-E5) [16]. All of them where trained on multilingual data, including English and Italian: actually, it is not so simple to select models that explicitly state that were trained also on Italian data. As the internal architecture, all of them are build upon Transformers encoder [17], and both BGE and m-E5 are small 5M models, while GTE is a 7B one. One vector database for each model was built using the LangChain framework7, and their retrieval performances are reported in Section 5.

Generator LLM We decided to stress the capabilities of Llama-3.1 8B models [ 5 ], the last decoderonly generative model of the Llama models family, that has a native support for Italian and English as well, and it is freely available. We tested both Foundational and Instruct models providing two diferent English prompts: Prompt 1 is designed as standard instruction-prompt, and it is suitable for Foundational models, while Prompt 2, follows the instruction prompt suggested for both Instruction tuning and inference by the authors of the Llama 3.1 models [ 5 ]. Prompts are reported below.

Prompt 1

5https://github.com/openai/tiktoken 6https://huggingface.co/spaces/mteb/leaderboard 7https://www.langchain.com/langchain

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: You are Unipa-GPT, the chatbot and virtual assistant of the University of Palermo. Provide a cordially and colloquially answers to the questions provided. If you receive a greeting, answer by greeting and introducing yourself. If you receive a question concerning the University of Palermo, answer relying on the documents given to you with the question. If you do not know how to answer, apologize and suggest that you consult the university website [https://www.unipa.it/], do not invent answers. If the question is in English, answer in English. If the question is in Italian, answer in Italian. ### Input: Question: question Documents: context ### Response: Prompt 2 You are Unipa-GPT, the chatbot and virtual assistant of the University of Palermo. Provide a cordially and colloquially answers to the questions provided. If you receive a greeting, answer by greeting and introducing yourself. If you receive a question concerning the University of Palermo, answer relying on the documents given to you with the question. If you do not know how to answer, apologize and suggest that you consult the university website [https://www.unipa.it/], do not invent answers. If the question is in English, answer in English. If the question is in Italian, answer in Italian. Question: question Documents: context

Both the prompts were used for querying Foundational models, while Prompt 2 was used only for Instruct models. Despite the multilingual task, we opted for an English prompt since it can be more lfexible in a real-wold application where the language of both the questions and the documents provided as inputs is not known a priori. After evaluating models in their base versions, we proceeded with a 3 epochs fine-tuning procedure over the best performing one, that is Llama-3.1-Instruct, using Prompt 2. Fine-tuning was performed with LoRA [18] following the Alpaca-LoRA hyper-parameters8 and we will refer to this model as UniQA-3-ft. We trained the model using the prompt associated with the best model, and providing the golden documents as context. We developed the entire system on a server with two Intel(R) Xeon(R) Gold 6442Y CPUs, 384 GB RAM, and two 48 GB NVIDIA RTX 6000 Ada Generation.

5. Results

To evaluate the retrieval performance, we performed at first a cluster analysis in the embedding space relying on the “native” clustering of the documents, that can be divided by language (Italian or English) or by department. The scatter plots of the embedding spaces for each Embeddings LLM are reported in Figure 2 and 3 are reported. Such plots were obtained after a dimensionality reduction performed using t-SNE [19]. Along with the graphical visualization, we provide an analytical analysis, calculating the Silhouette Coeficient [20], as in Table 10.

Both the graphical and the analytical representation highlight that m-E5 has the best clustering capabilities in language separation (Figure 2.c) while the other models tend to overlap the embeddings. Conversely, they perform better in grouping documents in a semantic way that is by Department. Particularly, GTE outperforms the other models (Figure 3.b). Indeed, documents referring to Computer and Mechanical Engineering degree courses, which are taught in the same Department have much more in common than the ones concerning Nursing or Law. Moreover, the Italian description of a degree course contains many English terms, and this can make it harder to cluster documents based on their native language.

The retrieval performances of the models were evaluated by querying their vector stores with question samples belonging to the test set9: if at least one of the retrieved document matches the golden one associated with the question, it was considered a correct retrieval. Thus, an accuracy measure was computed as it is reported in Table 10: here the superiority of GTE is confirmed with an accuracy of almost 86%, while BGE reaches an accuracy just over 81%, and m-E5 attains just 77%.

QA evaluation, was performed by querying the models using the reduced joint version of the test set (Table 11) the English only reduced split (Table 12), and the Italian only one (Table 13). Performance were measured in terms of BLEU [21] and Rouge score [22]. Inference ability in Llama 3.1 Foundational, Llama 9we used the reduced split in this experiment just like in the ones devoted to QA evaluation 3.1 Instruct and UniQA-3-ft was assessed without any quantization strategy. Llama 3.1 Foundational was queried using both Prompt 1 and Prompt 2, while Llama 3.1 Instruct and UniQA-3-ft were queried using only Prompt 2, since they are Instruction fine-tuned models. The evaluation consisted in two runs. In the first run, the golden context was provided in a one-shot scenario to the LLM without RAG, while the second one made use of GTE-based retriever module. The former run was aimed at evaluating the model inherent capabilities at generating a correct answer that adheres to the golden one, that is a re-paraphrase of the context. The latter run was aimed at evaluating end-to-end performances of the whole RAG architecture. We will refer to models that make use of RAG using the sufix retrieved.

Prompt 3 You are Unipa-GPT, the chatbot and virtual assistant of the University of Palermo. Provide a cordially and colloquially answers to the questions provided. If you receive a greeting, answer by greeting and introducing yourself. If you do not know how to answer, apologize and suggest that you consult the university website [https://www.unipa.it/], do not invent answers. If the question is in English, answer in English. If the question is in Italian, answer in Italian. Question: question

To provide a more comprehensive overview of our contribution, we tested our UniQA-3-ft fine-tuned model to asses its generation capabilities in a zero-shot scenario without RAG: we will refer to this evaluation configuration as UniQA-3-ft no-RAG. A suitable version of the Prompt 2 was devised for this purpose that we called Prompt 3, and does not contain any mention to rely on external documents for answer generation. In Tables 11, 12 and 13, best results for each run are in bold, while italicised score values have been used in the third run when UniQA-3-ft no-RAG performed better than UniQA-3-ft retrieved.

LLM

As it was expected, UniQA-3-ft and UniQA-3-ft retrieved outperform the other models in their respective runs, while their diference in performances is not so significant, and it mainly depends on the quality of the retrieved documents. Llama 3.1 Foundational performs a bit better using Prompt 2 with respect to Prompt 1, and Llama 3.1 Instruct shows clearly its ability to follow instructions in both settings.

UniQA-3-ft no-RAG reaches comparable performances to UniQA-3-ft retrieved, and in some it scores higher than the RAG version. This finding indicates clearly that UniQA is a high quality robust data set that can be used to test both fine-tuned models and RAG architectures. It is worth noticing that the

LLM structure of the train-test split guarantees that the answers provided by UniQA-3-ft no-RAG leverage only the knowledge acquired during the fine-tuning phase. In fact, when answering to a question belonging to the test set, the model is completely unaware on the degree courses that are not in its training set (Section 4.1). In this configuration, the inference capabilities of the model can be truly tested since it is relying on the acquired knowledge from the QA pairs of similar degree courses in the same Department.

Via a manual inspection of some of the generated answers, we found that both the non fine-tuned models and the fine-tuned ones, tend to output misspelled words, while both UniQA-3-ft no-RAG and the non fine-tuned models provide incorrect answers since they have no access to a complete knowledge of the UnPA domain, thus they reply leveraging their native and incmplete knowledge. The non fine-tuned models tend to output verbose answers, and not to provide important information, thus wandering of with a hypothetical course degree outline which is not required, and may be imprecise. Generally speaking, the non fine-tuned models may output some correct information, but in a diferent format as the one provided in the golden answer, thus making it more dificult to evaluate the overall correctness of the generated replies.

In both the golden answers and the retrieved documents a suggestion is reported for the final user to visit the website of the degree course to get more information: models try to generate links following the structure of the ones provided in the retrieved documents and in the prompt. The non fine-tuned models fail since either no link is generated at all or the generated link does not refer to the UniPA website. Fine-tuned models perform better, but not all the generated links are correct since misspellings are quite common.

6. Conclusions and future works

In this paper we presented UniQA, a high-quality QA data set in Italian and English suitable for translation and question-answering tasks where external knowledge is required. UniQA is a balanced data set among the two languages, and it didn’t require any translation because it was scraped from original Italian and English web pages of related to the degree courses issued at UniPA. UniQA counts 1048 documents and 13742 QA pairs generated in a semi-automated manner.

We also tested a RAG-based architecture for QA with external knowledge tasks whose generation LLMs were both Llama 3.1 Foundational and Llama 3.1 Instruct. Llama 3.1 was selected as a proof of concept because it is recognized as a SOTA multilingual LLM, while both the fine-tuning and the inference-only runs required a considerable amount of time on our local computational facilities. At the time of submitting the manuscript, extensive tests are being run using also both Foundation and Instruct LLMs that are based on diferent architectures than Llama as well as on the most known Italian adaptations of such models.

Future developments of this work are towards both extensive fine-tuning of the models under investigation and on end-to-end training of the whole RAG architecture including the retriever. Finally, a hybrid RAG architecture using both vector and graph databases is under development to encode both (vector) semantic similarity between documents and their closeness with respect to a domain ontology implemented as a graph of semantic relations between the documents in the corpus. [11] R. Figliè, T. Turchi, G. Baldi, D. Mazzei, Towards an llm-based intelligent assistant for industry 5.0, in: Proceedings of the 1st International Workshop on Designing and Building Hybrid Human–AI Systems (SYNERGY 2024), volume 3701, 2024. URL: https://ceur-ws.org/Vol-3701/paper7.pdf. [12] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with GPUs, IEEE Transactions on

Big Data 7 (2019) 535–547. [13] N. Muennighof, N. Tazi, L. Magne, N. Reimers, Mteb: Massive text embedding benchmark, arXiv preprint arXiv:2210.07316 (2022). URL: https://arxiv.org/abs/2210.07316. doi:10.48550/ARXIV. 2210.07316. [14] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, Z. Liu, Bge m3-embedding: Multi-lingual, multifunctionality, multi-granularity text embeddings through self-knowledge distillation, 2024. arXiv:2402.03216. [15] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie, M. Zhang, Towards general text embeddings with multi-stage contrastive learning, arXiv preprint arXiv:2308.03281 (2023). [16] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, F. Wei, Multilingual e5 text embeddings: A technical report, arXiv preprint arXiv:2402.05672 (2024). [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,

Attention is all you need, Advances in neural information processing systems 30 (2017). [18] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021). [19] L. Van der Maaten, G. Hinton, Visualizing data using t-sne., Journal of machine learning research 9 (2008). [20] P. J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,

Journal of computational and applied mathematics 20 (1987) 53–65. [21] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318. [22] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81.

[1]

Bonetta ,

C. D.

Hromei ,

Siciliani ,

M. A.

Stranisci , Preface to the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI) , in: Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024 ) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024 ), 2024 .

[2] OpenAI, Gpt-4 technical report , 2024 . URL: https://arxiv.org/abs/2303.08774. arXiv: 2303 . 08774 .

[3]

Team ,

Anil ,

Borgeaud ,

Wu , J.-B. Alayrac , J.

Yu , R.

Soricut , J.

Schalkwyk , A. M.

Dai , A.

Hauth , et al., Gemini: a family of highly capable multimodal models , arXiv preprint arXiv:2312.11805 ( 2023 ).

[4]

Lewis ,

Perez ,

Piktus ,

Petroni ,

Karpukhin ,

Goyal ,

Küttler ,

Lewis , W.-t. Yih,

Rocktäschel , et al., Retrieval-augmented generation for knowledge-intensive nlp tasks , Advances in Neural Information Processing Systems 33 ( 2020 ) 9459 - 9474 .

[5]

A. . M.

Llama Team , The llama 3 herd of models , 2024 . URL: https://arxiv.org/abs/2407.21783. arXiv: 2407 . 21783 .

[6]

Croce ,

Zelenanska ,

Basili , Neural learning for question answering in italian , in: C. Ghidini , B.

Magnini , A.

Passerini , P. Traverso (Eds.), AI*IA 2018 - Advances in Artificial Intelligence , Springer International Publishing, Cham, 2018 , pp. 389 - 402 .

[7]

Rajpurkar ,

Zhang ,

Lopyrev ,

Liang , Squad: 100 ,000+ questions for machine comprehension of text, 2016 . URL: https://arxiv.org/abs/1606.05250. arXiv: 1606 . 05250 .

[8]

Menini ,

Sprugnoli , A . Uva, “ who was pietro badoglio?” towards a QA system for Italian history , in: N. Calzolari , K.

Choukri , T.

Declerck , S.

Goggi , M.

Grobelnik , B.

Maegaard , J.

Mariani , H.

Mazo , A.

Moreno , J.

Odijk , S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) , European Language Resources Association (ELRA) , Portorož, Slovenia, 2016 , pp. 430 - 435 . URL: https://aclanthology.org/L16-1069.

[9]

Ghanbari Haez ,

Segala ,

Bellan ,

Magnolini ,

Sanna ,

Consolandi ,

Dragoni , A retrieval-augmented generation strategy to enhance medical chatbot reliability , in: J. Finkelstein , R. Moskovitch , E. Parimbelli (Eds.), Artificial Intelligence in Medicine , Springer Nature Switzerland, Cham, 2024 , pp. 213 - 223 .

[10]

Boccato ,

Ferrante ,

Toschi , Two-phase rag-based chatbot for italian funding application assistance , 2024 .