Unipa-GPT: a framework to assess open-source alternatives to Chat-GPT for Italian chat-bots Irene Siragusa1,2,∗ , Roberto Pirrone1 1 Department of Engineering, University of Palermo, Palermo, 90128, Sicily, Italy 2 Department of Computer Science, IT University of Copenhagen, København S, 2300, Denmark Abstract This paper illustrates the implementation of Open Unipa-GPT, an open-source version of the Unipa-GPT chat-bot that leverages open-source Large Language Models for embeddings and text generation. The system relies on a Retrieval Augmented Generation approach, thus mitigating hallucination errors in the generation phase. A detailed comparison between different models is reported to illustrate their performance as regards embedding generation, retrieval, and text generation. In the last case, models were tested in a simple inference setup after a fine-tuning procedure. Experiments demonstrate that an open-source LLMs can be efficiently used for embedding generation, but none of the models does reach the performances obtained by closed models, such as gpt-3.5-turbo in generating answers. Corpora and code are available on GitHub1 Keywords RAG, ChatGPT, LLM, Embedding 1. Introduction can be considered as an exception in the LLM landscape. Starting from this premises, in this paper we propose The increasing development of bigger and bigger Large Open Unipa-GPT, an open-source-based version of Unipa- Language Models (LLM), reaching 70B parameters as for GPT [10], that is a virtual assistant that uses a Retrieval Meta LLMs (Llama 2 [1] and Llama 3 [2]) and more as for Augmented Generation (RAG) approach [11] to answer OpenAI ones (GPT-3 [3] and GPT-4 [4]1 ), requires a sig- university-related questions issued by secondary school nificant computational resources for training, fine-tuning students. Open Unipa-GPT has been developed upon the or inference. OpenAI models are accessible only upon same architecture of Unipa-GPT, and uses open-source payment via OpenAI API and cannot be downloaded LLMs for embedding generation, retrieval, and text gen- in any way, while the open-source models by Meta are eration. Our models are small, compared to the ones used available also in the 8B and 13B parameters versions, in our original version, namely text-embedding-ada- and they can either be fine-tuned via Parameter-Efficient 002 and gpt-3.5-turbo from OpenAI. Fine-Tuning techniques (PEFT) [5] such as LoRA [6], or The paper is arranged as follows: related works are they can make direct inference using a 8-bit quantization reported in Section 2, while the architecture of Open [7] keeping the computational resources relatively small. Unipa-GPT is described in Section 3, and an overview The availability of open-source small-size LLMs is cru- of the data set is provided in Section 4. Experiments cial for developing Natural Language Process (NLP) ap- and related results are reported in Section 5. Finally, plications that leverage a fine-tuning procedure over a concluding remarks are drawn in Section 6. specific domain or language, as for Anita [8], an Italian 8B adaptation of Llama 3 . Nevertheless, GPT and Llama models cannot be con- 2. Related works sidered as truly open-source since their training data set is not available and, as for GPT models, and also their ac- The increasing interest in developing Language Models tual architecture is not accessible. Minerva [9] model, on (LM) for the Italian language, starts when BERT [12] was the other side, is an Italian and English LLM whose archi- first released and adapted models, such as AlBERTo [13] tecture, weights, and training data are accessible, but it were developed. After ChatGPT was made public [3, 4], an increasing interest in developing and using LLMs, and CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, in generative AI based on decoder-only model, was cru- Dec 04 — 06, 2024, Pisa, Italy cial, also for the Italian NLP community, thus leading to ∗ Corresponding author. Envelope-Open irene.siragusa02@unipa.it (I. Siragusa); the development of foundational models based on Llama roberto.pirrone@unipa.it (R. Pirrone) 2 [1] and Llama 3 [2]. Among those models, LLaMantino Orcid 0009-0005-8434-8729 (I. Siragusa); 0000-0001-9453-510X (chat version) [14] and Fauno [15], are based on Llama (R. Pirrone) 2 fine-tuned for chat purposes, while Camoscio [16] and © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1 Attribution 4.0 International (CC BY 4.0). Anita [8] are a fine-tuned Italian version of the instruct online rumors refers to 175B and 1T parameter for gpt-3.5-turbo version of Llama 2 and Llama 3 , respectively. and gpt-3.4 respectively CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: Overview of the Open Unipa-GPT architecture RAG is used in developing chat-bots which are 1K token chunks with an overlap of 50 tokens. Split doc- grounded in various domains where the models need uments are then processed by a LLM (the Embedding to be deeply guided in generation to avoid hallucination LLM) to generate the corresponding embedding, and in their answers. Various examples can be found in the store them in the vector database. Different LLMs were educational domain as for AI4LA [17], an assistant to used for embedding generation: we selected the best mod- students with Specific Learning Disorders (SLDs) like els according to the Massive Text Embedding Benchmark Dyslexia, Dysorthographia, and Dyscalculia, or as as- (MTEB) [24] for Information Retrieval3 . We selected only sistant providing information about restaurant industry models that explicitly state that they were trained and [18] or as chat-bot for Frequently Asked Questions (FAQ) tested also with Italian data. In the end, we selected [19]. Also chat-bots for the Italian language were imple- the following models: BGE-M3 (BGE) [25], E5-mistral- mented for real-wold applications, namely as assistant 7b-instruct (E5-mistral) [26], sentence-bert-base- for Italian Funding Application [20], or in the medical italian-xxl-uncased 4 (BERT-it) and Multilingual- domain [21] or in industrial context [22]. The aforemen- E5-large-instruct (m-E5) [27] . tioned works share the same architecture with the one A vector database was built for each model, and their we used to implement our model. In contrast with them, corresponding embedding spaces were compared to each we decided to stress capabilities of open-source LLMs other and with text-embedding-ada-002 , the embed- and do not rely on GPT-based models, that are used as ding model from OpenAI, to asses their retrieval perfor- baseline reference for text generation (gpt-3.5-turbo ) mances (Section 5). and as an external judge to evaluate performances of the other models (gpt-4.5-turbo ). 3.2. Generator The Generator uses the following Italian isntruction 3. System architecture prompt to answer to user questions: Open Unipa-GPT relies on two main components as it is Sei Unipa-GPT, chatbot e assistente virtuale shown in Figure 1 that is the Retriever and the Generator. dell’Università degli Studi di Palermo che In the following, the two components are detailed. risponde cordialmente e in forma colloqui- ale. Ai saluti, rispondi salutando e presen- 3.1. Retriever tandoti. Ricordati che il rettore dell’Univer- sità è il professore Massimo Midiri. Se la do- The Retriever is made up of a vector database built us- manda riguarda l’università degli studi di ing the LangChain framework2 , which makes use of the Palermo, rispondi in base alle informazioni Facebook AI Similarity Search (FAISS) library [23]. The e riporta i link ad esse associati; Se non vector database is filled with the documents belonging to the unipa-corpus (Appendix A), that are divided into 3 as in https://huggingface.co/spaces/mteb/leader-board in June 2024 4 https://huggingface.co/nickprock/ 2 https://www.langchain.com sentence-bert-base-italian-xxl-uncased sai rispondere alla domanda, rispondi di- from scraping either HTML pages or PDF documents cendo che sei un’intelligenza artificiale che that are publicly available on the website of the Univer- ha ancora molto da imparare e suggerisci sity of Palermo, and it includes information about all di andare su https://www.unipa.it/, non in- the available Bachelor/Master degree courses in the aca- ventare risposte. demic year 2023/2024 along with practical information for future students, e.g. how to pay taxes, the enroll- Below the English version: ment procedure, and the related deadlines. Starting from this data set, a QA data set was created with a semi- I am Unipa-GPT, a chatbot and virtual as- supervised procedure to allow instruction-tuning over sistant of the University of Palermo, who general-purpose LLMs. Further information about the responds cordially and in a colloquial man- unipa-corpus is reported in Appendix A. ner. To greetings, answer by greeting and As already mentioned The original Unipa-GPT was introducing yourself; Answer the question available for public unsupervised QA during the Euro- with the words ”Answer: ” Remember that pean Researchers’ Night in 2023, where a total of 165 the rector of the university is Professor Mas- questions was collected, along with feedback of users. simo Midiri. If the question concerns the On average, an interaction with the chat-bot was two University of Palermo, answer on the basis questions long, and we collected qualitative evaluation of the information and provide the links of the user experience through a suitable questionnaire associated with it; If you do not know how people were requested to fill on line just after having to answer the question, answer by saying chatted with Unipa-GPT. Questionnaires were further that you are an artificial intelligence that analyzed, and resulted in a general positive evaluation of still has a lot to learn and suggest that you the system’s performances by the majority of the users, go to https://www.unipa.it/, do not invent which were mostly University students. answers. To generate the golden QA pairs used to assess the dif- Both the question and the related relevant context are ferent performances of each generator LLM, we devised passed as input to the model, along with the prompt. As six typologies by the direct inspection of collected ques- regards the Generator LLM, we used Transformer-based tions. Particularly we groupte questions in Generic Infor- models [28]. We choose not to use LLMs based on Llama mation, Courses’ Information, Other University-related, 2 and deeply focused our work towards the most recent Services and Structures, Taxes and Scholarships, Univer- models, covering both Llama- and Mistral-based archi- sity Environment, and Off-topic. Next, we picked one tectures. In particular, Llama-3-8B-instruct [2] was question per typology, discarding the Off-topic ones, and used along with its adapted version for Italian, Anita-8B a golden answer was manually built for each of them [8], and Minerva-3B [9], which is a Mistral-based archi- by leveraging the actual relevant documents contained tecture [29]. All the generation LLMs were evaluated in the corpus, thus marking them as golden documents. both in their base version and in the instruction-tuned Note that if an answer can be elicited by multiple doc- one. The last ones were obtained via a three-epochs uments, all of them have been marked as golden. The fine-tuning procedure with the Alpaca-LoRA [6] strategy detailed list of the Italian QA pairs is reported in Ap- testing the Alpaca-LoRA hyper-parameters5 for both 20 pendix B in Table 4, while the English version is reported and 50 epochs. In the generation phase, models were in Table 5. Note that the English version is reported here asked to output at most 256 tokens. We manually gen- for full readability purposes, while only Italian data were erated a small set of Question-Answer (QA) pairs for used for evaluation. evaluation starting from the real questions issued by the public during the 2023 SHARPER European Researchers’ Night where Unipa-GPT was demonstrated. The proce- 5. Experimental results dure for building these QA pairs is reported in Section The proposed model is intended to work in an open QA 4. We developed the entire system on a server with 2 context, where correct answers are not known, thus, after Intel(R) Xeon(R) 6248R CPUs, 384 GB RAM, and two 48 a previous phase of qualitative evaluation [10] as in [17, GB NVIDIA RTX 6000 Ada Generation GPUs. 20, 21, 22], we opted for a quantitative analysis, relying on the small QA data set described in Section 4 to evaluate 4. The data set the performances against a set of golden labels in terms of both retrieval and answering capabilities [30, 19, 18]. The Italian documents data set built for Unipa-GPT is For each QA test pair, we retrieved the four most called unipa-corpus [10], and it has been generated relevant documents from each vector database related 5 to one of the open Embedding LLMs under investiga- https://github.com/tloen/alpaca-lora Table 1 Context Relevancy scores over different Embedding LLMs. Bold values refer to the most relevant documents selected by RAGAS among the first four documents retrieved using the RAG. Underlined values refer to the golden documents. Q1 Q2 Q3 model D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4 open-ai-ada 0,1 0,1 0,0833 0,0714 0,0909 0,125 0,625 0,333 0,1 0,111 0,111 0,111 e5-mistral 0,0217 0,0345 0,0345 0,0233 0,5 0,0526 0,0333 0,025 0,0345 0,0154 0,0185 0,0435 bge 0,0345 0,0217 0,0385 0,0233 0,0526 0,312 0,0333 0,0909 0,0345 0,0667 0,0435 0,0154 bert-it 0,125 0,125 0,0345 0,0345 0,25 0,333 0,125 0,143 0,172 0,125 0,143 0,0192 m-e5 0,125 0,0217 0,1 0,0833 0,25 0,333 0,5 0,5 0,333 0,0185 0,037 0,0345 Q4 Q5 Q6 model D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4 open-ai-ada 0,167 0,0588 0,1 0,333 0,429 0,5 0,143 0,111 1 0,1 0,167 0,5 e5-mistral 0,0417 0,05 0,05 0,276 0,154 0,04 0,5 0,0667 0,333 0,111 0,333 0,111 bge 0,241 0,0417 0,333 0,1 0,154 0,04 0,333 0,05 0,182 0,111 0,444 0,333 bert-it 0,152 0,0303 0,152 0,0303 0,5 0,04 0,0385 0,0769 0,111 0,333 0,25 0,25 m-e5 0,333 0,5 0,5 0,5 0,154 0,333 0,167 0,5 0,333 0,5 0,111 0,333 tion. Then we scored the retrieved documents in terms dimensionality reduction using t-SNE [33]. of their context relevancy with respect to the provided We used the six QA test pairs to obtain also a quantita- question using the RAGAS framework [31] that exploits tive evaluation of the correctness of the answers provided gpt-4-turbo for the evaluation. Results are reported by all the Generation LLMs under investigation. Com- in Table 1, and they include also the performances of parison was carried out against both the golden answers the original vector database using OpenAI embeddings and the ones generated via gpt-3.5-turbo (GPT) in the (text-embedding-ada-002 , referred as open-ai-ada ). original Unipa-GPT set up. The proposed evaluation The overall scores are not so high, and also the highest task, can be regarded as an open QA one where, despite relevancy do not always correspond to the golden docu- a golden answer is provided for a given question, diverse ment used for generating the corresponding answer. In correct answers can be proposed with different linguistic Table 1 the underlined values are the ones associated with nuances, according to Italian diaphasic variation [34]. To golden documents, while the bold ones are the highest evaluate both strict and light correctness of the generated RAGAS values. A model is considered to perform cor- answers, we employed traditional QA metrics such as rectly if the highest context relevancy score is assigned BLEU [35] (Figure 2.a) and ROGUE-L score [36] (Figure to one of the golden documents. This evaluation pro- 2.b) and novel metrics leveraging the RAGAS framework cedure led to select E5-mistral as the best performing [31] to evaluate Faithfulness (Figure 2.c) and Correctness Embeddings LLM among the ones we investigated. (Figure 2.d) of the generated output. Such measures re- Superior performances of E5-mistral are also con- quest an external LLM acting as a “judge”, and we we used firmed by a deep analysis on the embeddings space by gpt-4-turbo in this respect. More specifically, Faithful- means of two different clustering procedures. We clus- ness measures the factual consistency of the generated tered the embeddings generated by each LLM starting answer against the given context, while Correctness in- from the documents belonging to both sections Educa- volves gauging the accuracy of the generated answer tional Offer and Future Students of the UniPA website. when compared to the ground truth. Both metrics range The firs group of documents is the list of all the available from 0 to 1 and better performances are associated with courses at the University, while the second group con- higher scores. tains useful information for future students who want to Both BLEU and ROUGE scores are generally low, but enroll in a degree course. We clustered the embedding we assume that this is mainly related to the fact that spaces according to the either the course degree typol- an exact match cannot be reached between the golden ogy (bachelor/master degree) or the Department where answer and the generated one, and a more semantically a degree course is affiliated to. Quantitative measures of comparison should be taken into account. Overall, an- the clustering goodness are reported in Table 2, where swers generated by gpt-3.5-turbo can be considered as the Silhouette Coefficients [32] have been computed for the best ones as they attain highest values. By contrast, each model, and again E5-mistral is the best performing fine-tuning did not provided a desired improvement in one. In Appendix C, we report the scatter plots of the the open-source models: all BLEU scores are almost zero, embedding spaces for each Embeddings LLM (Figure 3 except for Anita-8B . ROUGE scores are higher than and Figure 4 ). Plots have been obtained through a 2D the corresponding BLEU ones, and again the base ver- Table 2 Silhouette Coefficients for each Embedding LLM with reference to the two proposed clustering schemes, that is the degree courses typology and their affiliation to a particular Department. Retriever Silhouette score typology Silhouette score Departments openAI-ada -0.0915 -0.0627 E5-mistral -0.0194 -0.0048 BGE -0.0422 -0.0708 BERT-it -0.0221 -0.0367 m-E5 -0.0982 -0.0503 Figure 2: Inference results over the generated answers according the following scores: (a) BLEU, (b) ROUGE, (c) Faithfulness and (d) Correctness. Due to displaying reasons, (a,b) are represented in a [0, 0.6] range, while (c,d) in a [0,1] range. sion of each LLM performes better than the fine-tuned is not significantly beneficial in terms of performance ones. Generally speaking, Anita-8B and Llama-3-8B- increase for any model and, even if it does not reach instruct outperform Minerva , since both reach com- the same performances, Anita-8B seems to be the most parable scores, but we assume that the tailored Italian valuable alternative to GPT. fine-tuning over Llama-3 to obtain Anita-8B was crucial A manual inspection of the generated answers, out- to make it the best performing open-source model during lines a common issue related to the tokenization of the this first automatic evaluation phase. generated output: despite of its semantic correctness, the gpt-3.5-turbo exhibits the best Faithfulness scores generated text is outputted as a unique word without any despite being surpassed by Anita-8B in question Q2, and spaces, as also these results confirm the previous considerations Glielezionidelcorsosaracondottoattraver- about BLEU scores. Something changes in evaluating sounaprocesso models in terms of their Correctness: in this case gpt- 3.5-turbo is the best model in three answers out of six, in Llama-3-8B-instruct , or it is over-splitted as followed by Anita-8B (two best results) and Minerva- 3B-20 (one best result). We are aware that gpt-based e-domandre d’i-s-c-r-i-z-ion-e-per-l-A.- evaluation may lead to a preference over GPT models A.–2023–/—-cor-so-n-d’-l-a-u-re-’-(M-ag- themselves, but gpt-4-turbo was the only high quality g-is-t-ra-le)–a-dd-ac-ce-o-lib-ro generative model we had access to at the time of making in Anita-8B . These errors make the models not suitable the experiments. for human interaction, since it is not possible read the Overall results confirm that a (moderate) fine-tuning generated answers. We argue that a deeper analysis on the tokenizer that has been used and, a hyper-parametrs References tuning in the generator, may lead to an increase of perfor- mances. Models tend also to answer in other languages [1] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- as hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- gava, S. Bhosale, et al., Llama 2: Open founda- * La durada édié depresso àdue años, * Ac- tion and fine-tuned chat models, arXiv preprint cesso libre! * Dipartment of Physics & arXiv:2307.09288 (2023). Chemistry “Emilo Segré” Codice course : [2] A. . M. Llama Team, The llama 3 herd of mod- 21915 els, 2024. URL: https://arxiv.org/abs/2407.21783. arXiv:2407.21783 . in Llama-3-8B-instruct . We argue that this trouble [3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- can be related to the memory of multi-lingual models that plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- uses texts also in French and Spanish despite the Italian try, A. Askell, et al., Language models are few-shot fine-tuning. It is worth noticing that those languages are learners, Advances in neural information process- linguistically close to Italian and together belong to the ing systems 33 (2020) 1877–1901. Romance Languages [37]. Thus, even if the output has [4] OpenAI, Gpt-4 technical report, 2024. URL: https: to be considered wrong, a linguistic connection can be //arxiv.org/abs/2303.08774. arXiv:2303.08774 . highlighted. [5] L. Xu, H. Xie, S.-Z. J. Qin, X. Tao, F. L. Wang, The most unsatisfactory results are reported for Parameter-efficient fine-tuning methods for pre- Minerva-3B : the model does not generate any answer trained language models: A critical review and related to the given question, and it seems that answers assessment, 2023. URL: https://arxiv.org/abs/2312. where generated with samples from model’s training 12148. arXiv:2312.12148 . set. As stated before, a tuning of the generator hyper- [6] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, parameters may help in this case. S. Wang, L. Wang, W. Chen, Lora: Low-rank adap- Despite the promising results, in some cases answers tation of large language models, arXiv preprint by both Anita-8B and Llama-3-8B-instruct are not arXiv:2106.09685 (2021). good from a grammatical point of view, since they are [7] T. Dettmers, M. Lewis, Y. Belkada, L. Zettlemoyer, full of mistakes, thus making them not yet ready to be Llm.int8(): 8-bit matrix multiplication for trans- used in real-world applications compared to OpenAI’s formers at scale, 2022. URL: https://arxiv.org/abs/ ones. 2208.07339. arXiv:2208.07339 . [8] M. Polignano, P. Basile, G. Semeraro, Advanced 6. Conclusions and future works natural-based interaction for the italian language: Llamantino-3-anita, 2024. arXiv:2405.07101 . In this paper we presented Open Unipa-GPT, a virtual [9] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S. Conia, assistant, which is based solely on open-source LLMs, E. Barba, R. Navigli, Minerva technical report, 2024. and uses a RAG approach to answer Italian university- URL: https://nlp.uniroma1.it/minerva/. related questions from secondary school students. The [10] I. Siragusa, R. Pirrone, Unipa-gpt: Large lan- main intent of the presented research was setting up a guage models for university-oriented qa in ital- sort of framework to test open-source small size LLMs, ian, 2024. URL: https://arxiv.org/abs/2407.14246. with either moderate or no fine-tuning at all, to be used arXiv:2407.14246 . for generating the embeddings and/or as text generation [11] P. Lewis, E. Perez, A. Piktus, F. Petroni, front-end in a RAG set up. V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Our study led us to devise E5-mistral-7b-instruct Yih, T. Rocktäschel, et al., Retrieval-augmented as a valuable open-source alternative to OpenAI’s em- generation for knowledge-intensive nlp tasks, beddings, while none of the considered models attain a Advances in Neural Information Processing generation performance comparable to gpt-3.5-turbo , Systems 33 (2020) 9459–9474. even after a fine-tuning procedure. The most promis- [12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: ing Generation LLM, when plunged in our architecture, Pre-training of deep bidirectional transformers for appears to be Anita-8B , but it still shows some issues language understanding, in: J. Burstein, C. Do- related to both the tokenization and the grammatical ran, T. Solorio (Eds.), Proceedings of the 2019 Con- correctness of the output. We are currently working ference of the North American Chapter of the As- to deep exploration of different fine-tuning approaches sociation for Computational Linguistics: Human along with the use of huge size open-source LLMs for Language Technologies, Volume 1 (Long and Short text generation. Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423. doi:10.18653/ [25] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, v1/N19- 1423 . Z. Liu, Bge m3-embedding: Multi-lingual, [13] M. Polignano, P. Basile, M. Degemmis, G. Semeraro, multi-functionality, multi-granularity text em- V. Basile, Alberto: Italian bert language under- beddings through self-knowledge distillation, standing model for nlp challenging tasks based on 2024. URL: https://arxiv.org/abs/2402.03216. tweets, in: Italian Conference on Computational arXiv:2402.03216 . Linguistics, 2019. URL: https://api.semanticscholar. [26] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, org/CorpusID:204914950. F. Wei, Improving text embeddings with large lan- [14] P. Basile, E. Musacchio, M. Polignano, L. Siciliani, guage models, arXiv preprint arXiv:2401.00368 G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod- (2023). els for effective text generation in italian language, [27] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, 2023. arXiv:2312.09993 . F. Wei, Multilingual e5 text embeddings: A techni- [15] A. Bacciu, G. Trappolini, A. Santilli, E. Rodolà, F. Sil- cal report, arXiv preprint arXiv:2402.05672 (2024). vestri, Fauno: The italian large language model [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, that will leave you senza parole!, arXiv preprint L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- arXiv:2306.14457 (2023). tention is all you need, Advances in neural infor- [16] A. Santilli, E. Rodolà, Camoscio: an italian mation processing systems 30 (2017). instruction-tuned llama, 2023. arXiv:2307.16456 . [29] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam- [17] S. D’Urso, F. Sciarrone, Ai4la: An intelligent chat- ford, D. S. Chaplot, D. de las Casas, F. Bressand, bot for supporting students with dyslexia, based on G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.- generative ai, in: A. Sifaleras, F. Lin (Eds.), Gen- A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, erative Intelligence and Intelligent Tutoring Sys- T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https: tems, Springer Nature Switzerland, Cham, 2024, pp. //arxiv.org/abs/2310.06825. arXiv:2310.06825 . 369–377. [30] S. Vidivelli, M. Ramachandran, A. Dharunbal- [18] V. Bhat, D. Sree, J. Cheerla, N. Mathew, G. LIu, J. Gao, aji, Efficiency-driven custom chatbot develop- Retrieval augmented generation (rag) based restau- ment: Unleashing langchain, rag, and performance- rant chatbot with ai testability, 2024. optimized llm fusion., Computers, Materials & Con- [19] M. Kulkarni, P. Tangarajan, K. Kim, A. Trivedi, Re- tinua 80 (2024). inforcement learning for optimizing rag for do- [31] S. Es, J. James, L. Espinosa-Anke, S. Schockaert, Ra- main chatbots, 2024. URL: https://arxiv.org/abs/ gas: Automated evaluation of retrieval augmented 2401.06800. arXiv:2401.06800 . generation, arXiv preprint arXiv:2309.15217 (2023). [20] T. Boccato, M. Ferrante, N. Toschi, Two-phase rag- [32] P. J. Rousseeuw, Silhouettes: a graphical aid to based chatbot for italian funding application assis- the interpretation and validation of cluster analysis, tance, 2024. Journal of computational and applied mathematics [21] S. Ghanbari Haez, M. Segala, P. Bellan, S. Mag- 20 (1987) 53–65. nolini, L. Sanna, M. Consolandi, M. Dragoni, A [33] L. Van der Maaten, G. Hinton, Visualizing data retrieval-augmented generation strategy to en- using t-sne., Journal of machine learning research hance medical chatbot reliability, in: J. Finkelstein, 9 (2008). R. Moskovitch, E. Parimbelli (Eds.), Artificial Intel- [34] G. Berruto, Variazione diafasica, 2011. ligence in Medicine, Springer Nature Switzerland, URL: https://www.treccani.it/enciclopedia/ Cham, 2024, pp. 213–223. variazione-diafasica_(Enciclopedia-dell'Italiano)/. [22] R. Figliè, T. Turchi, G. Baldi, D. Mazzei, Towards an [35] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a llm-based intelligent assistant for industry 5.0, in: method for automatic evaluation of machine trans- Proceedings of the 1st International Workshop on lation, in: Proceedings of the 40th annual meeting Designing and Building Hybrid Human–AI Systems of the Association for Computational Linguistics, (SYNERGY 2024), volume 3701, 2024. URL: https: 2002, pp. 311–318. //ceur-ws.org/Vol-3701/paper7.pdf. [36] C.-Y. Lin, Rouge: A package for automatic eval- [23] J. Johnson, M. Douze, H. Jégou, Billion-scale simi- uation of summaries, in: Text summarization larity search with GPUs, IEEE Transactions on Big branches out, 2004, pp. 74–81. Data 7 (2019) 535–547. [37] T. Alkire, C. Rosen, Romance languages: A histori- [24] N. Muennighoff, N. Tazi, L. Magne, N. Reimers, cal introduction, Cambridge University Press, 2010. Mteb: Massive text embedding benchmark, arXiv preprint arXiv:2210.07316 (2022). URL: https: //arxiv.org/abs/2210.07316. doi:10.48550/ARXIV. 2210.07316 . A. unipa-corpus details unipa-corpus [10] is a collection of Italian documents that were retrieved directly from the website of the Uni- versity of Palermo in Semptember 2023. The corpus is divided in two main sections, namely Education, that groups the available bachelor and master degree courses, and Future Students where important information about taxes payment and enrollment procedure are reported. For fine-tuning purposes, a semi-automatic procedure, involving gpt-3.5-turbo [3], was implemented to build a QA dataset. In Table 3 are reported the statistics of unipa-corpus . Table 3 Number of documents and QA pairs in unipa-corpus . Education Future Students Documents 506 104 Tokens 1072214 987424 QA pairs train 506 269 Tokens train 191612 68160 QA pairs val 253 133 Tokens val 93443 29675 B. Inference QA pairs Table 4 Overview of 6 QA pairs manually generated used for evaluation purposes IDs Questions Answers Il professore di Intelligenza Artificiale 1 del corso di Laurea Magistrale Chi è il professore di Intelligenza in Ingegneria Informatica è il professore Gaglio e l’insegnamento verrà Q1 Artificiale 1 per il corso di Laurea erogato durante il primo semestre. Per maggiori informazioni vai su Magistrale in Ingegneria Informatica? http://www.unipa.it/struttura.html?id=721 La presentazione delle domande di iscrizione per l’Anno Accademico 2023/2024 varia in base alla tipologia di corso. Per i corsi di studio triennali e magistrali a ciclo unico, la domanda può essere presentata dall’1 agosto al 30 settembre 2023 , mentre per i corsi magistrali, dall’1 Quali sono le scadenze di iscrizione Q2 agosto al 30 novembre 2023, in entrambi i casi, è richiesto il pagamento ad un corso di laurea? della prima rata delle tasse universitarie. Per i corsi di studio ad accesso programmato locale, con prenotazione online o ad accesso programmato nazionale, le immatricolazioni vanno effettuate entro i termini previsti dal relativi bandi di concorso. Il corso di laurea magistrale in Chimica presso l’Università degli Studi di Palermo, ha sede a Palermo e ha una durata di 2 anni. L’obiettivo del corso è la formazione di laureati con una solida preparazione di base, che li ponga in grado di affrontare con competenza ed autonomia sia attività professionali sia quelle di ricerca accademica ed industriale. Per questo motivo il corso di studio è progettato con lo scopo di Come funziona la magistrale di Q3 completare la cultura scientifica nelle discipline chimiche fondamentali chimica? e di introdurre conoscenze più avanzate. Quindi lo studente può completare il proprio piano di studi, attenzionando gli insegnamenti che meglio si adattano ai propri interessi specifici, come la chimica dei materiali, dell’ambiente e dei beni culturali, la chimica supramolecolare e la didattica della chimica. Per maggiori informazioni vai su http://www.unipa.it/struttura.html?id=766 Ciao! Il corso di Laurea in Consulente Giuridico d’Impresa sembra proprio fare al caso tuo! Il corso di laurea ha sede a Trapani e ha una ciao! sono un ragazzo appena uscito durata di 3 anni e ti fornirà le conoscenze in ambito giuridico ed dal liceo che è interessato al settore Q4 economico-aziendalistico. Una volta terminato il percorso di studio legale,in particolare alle leggi sulle potrai svolgere attività interdisciplinari che richiedono competenze aziende. dove potrei iscrivermi? giuridiche, aziendalistiche e organizzativo-gestionali. Per maggiori informazioni vai su http://www.unipa.it/struttura.htmlid=1557 È possibile recarsi in segreteria il lunedì, mercoledì e venerdì dalle 10.00 come posso prenotare un alle 12.00, martedì e giovedì dalle 15.00 alle 17.00 . Puoi prenotare il Q5 appuntamento in segreteria? tuo turno attraverso la App SolariQ. Per maggiori informazioni vai su https://www.unipa.it/servizi/segreterie/ Il pagamento delle tasse deve essere effettuato esclusivamente mediante sistema PAgoPA (Pagamenti della Pubblica Amministrazione). Dopo aver compilato la pratica online, è possibile pagare direttamente online con il sistema PAgoPA o stampare il bollettino e pagare presso tabaccai convenzionati o ricevitorie abilitate PAgoPA. Ulteriori informazioni sul Q6 Come si pagano le tasse? pagamento via PAgoPA sono reperibili qui https://immaweb.unipa.it/ immaweb/public/pagamenti.seam, mentre è disponibile il Regolamento in materia di contribuzione studentesca https://www.unipa.it/servizi/segreterie/ .content/documenti/regolamenti_calendari/2023/5105144- def_regolamento-contribuzione–studentesca-2023—24-2.pdf Table 5 English version of Table 4. IDs Questions Answers Who is the Artificial The Artificial Intelligence 1 professor for the Computer Engineering Intelligence 1 professor Master degree course is Professor Gaglio and it will be delivered Q1 for Computer Engineering during the first semester. For more information go to Master degree course? http://www.unipa.it/struttura.html?id=721 The submission of applications for the Academic Year 2023/2024 varies according to the type of course. For three-year and single-cycle master’s degree courses, applications can be submitted from 1 August What are the deadlines to 30 September 2023, while for master’s degree courses, from 1 August Q2 for enrolling in a to 30 November 2023; in both cases, payment of the first instalment of degree programme? tuition fees is required. For courses with local programmed access, with online booking or national programmed access, enrolment must be carried out by the deadlines set out in the corresponding calls for application. The Master’s degree course in Chemistry at the University of Palermo is based in Palermo and lasts 2 years. The aim of the course is to train graduates with a good background, enabling them to deal competently and independently with both professional activities and academic and industrial research. For this reason, the course is designed How does the master’s with the aim of completing the scientific culture in the fundamental Q3 degree in chemistry work? chemical disciplines and introducing more advanced knowledge. Therefore, students can complete their study plan by focusing on the subjects that best suit their specific interests, such as the chemistry of materials, the environment and cultural heritage, supramolecular chemistry and the didactics of chemistry. For more information go to http://www.unipa.it/struttura.html?id=766 Hi! The Bachelor of Business Law Consultant programme sounds like it could be just the thing for you! The degree course is based in Trapani hello! I’m a guy just out of high and lasts 3 years and will provide you with knowledge in the fields of school who is interested in law, Q4 law and business economics. Once you have completed the course you especially corporate law. will be able to carry out interdisciplinary activities requiring legal, where should i apply? business and organisational-managerial skills. For more information go to http://www.unipa.it/struttura.html?id=1557 You can go to the secretariat on Mondays, Wednesdays and Fridays from how can i book an appointment 10 a.m. to 12 noon, Tuesdays and Thursdays from 3 p.m. to 5 p.m. . Q5 at the secretariat? You can book your appointment through the SolariQ App. For more information go to https://www.unipa.it/servizi/segreterie/ Fees must be paid exclusively through the PAgoPA (Public Administration Payments) system, which is accessed through the university portal. After completing the paperwork online, you can either pay directly online via the PAgoPA system or print out the payment slip and pay at a Q6 How do I pay fees? PAgoPA-enabled tax office. Further information on paying via PAgoPA can be found here https://immaweb.unipa.it/immaweb/public/pagamenti. seam, while the Student Contribution Regulations is available here https://www.unipa.it/servizi/segreterie/.content/documents/regulations_ calendars/2023/5105144-def_regulation-student-contribution-2023-24-2.pdf C. Embedding spaces Figure 3: Scatter plots of embedding spaces labeled as for typology Figure 4: Scatter plots of embedding spaces labeled as for department