=Paper=
{{Paper
|id=Vol-3878/100_main_long
|storemode=property
|title=Unipa-GPT: A Framework to Assess Open-source Alternatives to Chat-GPT for Italian Chat-bots
|pdfUrl=https://ceur-ws.org/Vol-3878/100_main_long.pdf
|volume=Vol-3878
|authors=Irene Siragusa,Roberto Pirrone
|dblpUrl=https://dblp.org/rec/conf/clic-it/SiragusaP24
}}
==Unipa-GPT: A Framework to Assess Open-source Alternatives to Chat-GPT for Italian Chat-bots==
<pdf width="1500px">https://ceur-ws.org/Vol-3878/100_main_long.pdf</pdf>
<pre>
                                Unipa-GPT: a framework to assess open-source alternatives
                                to Chat-GPT for Italian chat-bots
                                Irene Siragusa1,2,∗ , Roberto Pirrone1
                                1
                                    Department of Engineering, University of Palermo, Palermo, 90128, Sicily, Italy
                                2
                                    Department of Computer Science, IT University of Copenhagen, København S, 2300, Denmark


                                                 Abstract
                                                 This paper illustrates the implementation of Open Unipa-GPT, an open-source version of the Unipa-GPT chat-bot that leverages
                                                 open-source Large Language Models for embeddings and text generation. The system relies on a Retrieval Augmented
                                                 Generation approach, thus mitigating hallucination errors in the generation phase. A detailed comparison between different
                                                 models is reported to illustrate their performance as regards embedding generation, retrieval, and text generation. In the
                                                 last case, models were tested in a simple inference setup after a fine-tuning procedure. Experiments demonstrate that an
                                                 open-source LLMs can be efficiently used for embedding generation, but none of the models does reach the performances
                                                 obtained by closed models, such as gpt-3.5-turbo in generating answers. Corpora and code are available on GitHub1

                                                 Keywords
                                                 RAG, ChatGPT, LLM, Embedding


                                1. Introduction                                                                         can be considered as an exception in the LLM landscape.
                                                                                                                           Starting from this premises, in this paper we propose
                                The increasing development of bigger and bigger Large                                   Open Unipa-GPT, an open-source-based version of Unipa-
                                Language Models (LLM), reaching 70B parameters as for                                   GPT [10], that is a virtual assistant that uses a Retrieval
                                Meta LLMs (Llama 2 [1] and Llama 3 [2]) and more as for                                 Augmented Generation (RAG) approach [11] to answer
                                OpenAI ones (GPT-3 [3] and GPT-4 [4]1 ), requires a sig-                                university-related questions issued by secondary school
                                nificant computational resources for training, fine-tuning                              students. Open Unipa-GPT has been developed upon the
                                or inference. OpenAI models are accessible only upon                                    same architecture of Unipa-GPT, and uses open-source
                                payment via OpenAI API and cannot be downloaded                                         LLMs for embedding generation, retrieval, and text gen-
                                in any way, while the open-source models by Meta are                                    eration. Our models are small, compared to the ones used
                                available also in the 8B and 13B parameters versions,                                   in our original version, namely text-embedding-ada-
                                and they can either be fine-tuned via Parameter-Efficient                               002 and gpt-3.5-turbo from OpenAI.
                                Fine-Tuning techniques (PEFT) [5] such as LoRA [6], or                                     The paper is arranged as follows: related works are
                                they can make direct inference using a 8-bit quantization                               reported in Section 2, while the architecture of Open
                                [7] keeping the computational resources relatively small.                               Unipa-GPT is described in Section 3, and an overview
                                   The availability of open-source small-size LLMs is cru-                              of the data set is provided in Section 4. Experiments
                                cial for developing Natural Language Process (NLP) ap-                                  and related results are reported in Section 5. Finally,
                                plications that leverage a fine-tuning procedure over a                                 concluding remarks are drawn in Section 6.
                                specific domain or language, as for Anita [8], an Italian
                                8B adaptation of Llama 3 .
                                   Nevertheless, GPT and Llama models cannot be con-                                    2. Related works
                                sidered as truly open-source since their training data set
                                is not available and, as for GPT models, and also their ac-                                                       The increasing interest in developing Language Models
                                tual architecture is not accessible. Minerva [9] model, on                                                        (LM) for the Italian language, starts when BERT [12] was
                                the other side, is an Italian and English LLM whose archi-                                                        first released and adapted models, such as AlBERTo [13]
                                tecture, weights, and training data are accessible, but it                                                        were developed. After ChatGPT was made public [3, 4],
                                                                                                                                                  an increasing interest in developing and using LLMs, and
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, in generative AI based on decoder-only model, was cru-
                                Dec 04 — 06, 2024, Pisa, Italy                                                                                    cial, also for the Italian NLP community, thus leading to
                                ∗
                                     Corresponding author.
                                Envelope-Open irene.siragusa02@unipa.it (I. Siragusa);
                                                                                                                                                  the development of foundational models based on Llama
                                roberto.pirrone@unipa.it (R. Pirrone)                                                                             2 [1] and Llama 3 [2]. Among those models, LLaMantino
                                Orcid 0009-0005-8434-8729 (I. Siragusa); 0000-0001-9453-510X                                                      (chat version) [14] and Fauno [15], are based on Llama
                                (R. Pirrone)                                                                                                      2 fine-tuned for chat purposes, while Camoscio [16] and
                                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License

                                1
                                                     Attribution 4.0 International (CC BY 4.0).                                                   Anita   [8] are a fine-tuned Italian version of the instruct
                                    online rumors refers to 175B and 1T parameter for gpt-3.5-turbo
                                                                                                                                                  version of Llama 2 and Llama 3 , respectively.
                                    and gpt-3.4 respectively


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: Overview of the Open Unipa-GPT architecture


   RAG is used in developing chat-bots which are             1K token chunks with an overlap of 50 tokens. Split doc-
grounded in various domains where the models need            uments are then processed by a LLM (the Embedding
to be deeply guided in generation to avoid hallucination     LLM) to generate the corresponding embedding, and
in their answers. Various examples can be found in the       store them in the vector database. Different LLMs were
educational domain as for AI4LA [17], an assistant to        used for embedding generation: we selected the best mod-
students with Specific Learning Disorders (SLDs) like        els according to the Massive Text Embedding Benchmark
Dyslexia, Dysorthographia, and Dyscalculia, or as as-        (MTEB) [24] for Information Retrieval3 . We selected only
sistant providing information about restaurant industry      models that explicitly state that they were trained and
[18] or as chat-bot for Frequently Asked Questions (FAQ)     tested also with Italian data. In the end, we selected
[19]. Also chat-bots for the Italian language were imple-    the following models: BGE-M3 (BGE) [25], E5-mistral-
mented for real-wold applications, namely as assistant       7b-instruct (E5-mistral) [26], sentence-bert-base-
for Italian Funding Application [20], or in the medical      italian-xxl-uncased 4 (BERT-it) and Multilingual-
domain [21] or in industrial context [22]. The aforemen-     E5-large-instruct (m-E5) [27] .
tioned works share the same architecture with the one           A vector database was built for each model, and their
we used to implement our model. In contrast with them,       corresponding embedding spaces were compared to each
we decided to stress capabilities of open-source LLMs        other and with text-embedding-ada-002 , the embed-
and do not rely on GPT-based models, that are used as        ding model from OpenAI, to asses their retrieval perfor-
baseline reference for text generation (gpt-3.5-turbo )      mances (Section 5).
and as an external judge to evaluate performances of the
other models (gpt-4.5-turbo ).                               3.2. Generator
                                                             The Generator uses the following Italian isntruction
3. System architecture                                       prompt to answer to user questions:
Open Unipa-GPT relies on two main components as it is                    Sei Unipa-GPT, chatbot e assistente virtuale
shown in Figure 1 that is the Retriever and the Generator.               dell’Università degli Studi di Palermo che
In the following, the two components are detailed.                       risponde cordialmente e in forma colloqui-
                                                                         ale. Ai saluti, rispondi salutando e presen-
3.1. Retriever                                                           tandoti. Ricordati che il rettore dell’Univer-
                                                                         sità è il professore Massimo Midiri. Se la do-
The Retriever is made up of a vector database built us-                  manda riguarda l’università degli studi di
ing the LangChain framework2 , which makes use of the                    Palermo, rispondi in base alle informazioni
Facebook AI Similarity Search (FAISS) library [23]. The                  e riporta i link ad esse associati; Se non
vector database is filled with the documents belonging
to the unipa-corpus (Appendix A), that are divided into      3
                                                                 as in https://huggingface.co/spaces/mteb/leader-board in June 2024
                                                             4
                                                                 https://huggingface.co/nickprock/
2
    https://www.langchain.com                                    sentence-bert-base-italian-xxl-uncased
           sai rispondere alla domanda, rispondi di-      from scraping either HTML pages or PDF documents
           cendo che sei un’intelligenza artificiale che  that are publicly available on the website of the Univer-
           ha ancora molto da imparare e suggerisci       sity of Palermo, and it includes information about all
           di andare su https://www.unipa.it/, non in-    the available Bachelor/Master degree courses in the aca-
           ventare risposte.                              demic year 2023/2024 along with practical information
                                                          for future students, e.g. how to pay taxes, the enroll-
   Below the English version:                             ment procedure, and the related deadlines. Starting from
                                                          this data set, a QA data set was created with a semi-
        I am Unipa-GPT, a chatbot and virtual as-
                                                          supervised procedure to allow instruction-tuning over
        sistant of the University of Palermo, who
                                                          general-purpose LLMs. Further information about the
        responds cordially and in a colloquial man-
                                                          unipa-corpus is reported in Appendix A.
        ner. To greetings, answer by greeting and
                                                             As already mentioned The original Unipa-GPT was
        introducing yourself; Answer the question
                                                          available for public unsupervised QA during the Euro-
        with the words ”Answer: ” Remember that
                                                          pean Researchers’ Night in 2023, where a total of 165
        the rector of the university is Professor Mas-
                                                          questions was collected, along with feedback of users.
        simo Midiri. If the question concerns the
                                                          On average, an interaction with the chat-bot was two
        University of Palermo, answer on the basis
                                                          questions long, and we collected qualitative evaluation
        of the information and provide the links
                                                          of the user experience through a suitable questionnaire
        associated with it; If you do not know how
                                                          people were requested to fill on line just after having
        to answer the question, answer by saying
                                                          chatted with Unipa-GPT. Questionnaires were further
        that you are an artificial intelligence that
                                                          analyzed, and resulted in a general positive evaluation of
        still has a lot to learn and suggest that you
                                                          the system’s performances by the majority of the users,
        go to https://www.unipa.it/, do not invent
                                                          which were mostly University students.
        answers.
                                                             To generate the golden QA pairs used to assess the dif-
   Both the question and the related relevant context are ferent performances of each generator LLM, we devised
passed as input to the model, along with the prompt. As six typologies by the direct inspection of collected ques-
regards the Generator LLM, we used Transformer-based tions. Particularly we groupte questions in Generic Infor-
models [28]. We choose not to use LLMs based on Llama mation, Courses’ Information, Other University-related,
2 and deeply focused our work towards the most recent Services and Structures, Taxes and Scholarships, Univer-
models, covering both Llama- and Mistral-based archi- sity Environment, and Off-topic. Next, we picked one
tectures. In particular, Llama-3-8B-instruct [2] was question per typology, discarding the Off-topic ones, and
used along with its adapted version for Italian, Anita-8B a golden answer was manually built for each of them
[8], and Minerva-3B [9], which is a Mistral-based archi- by leveraging the actual relevant documents contained
tecture [29]. All the generation LLMs were evaluated in the corpus, thus marking them as golden documents.
both in their base version and in the instruction-tuned Note that if an answer can be elicited by multiple doc-
one. The last ones were obtained via a three-epochs uments, all of them have been marked as golden. The
fine-tuning procedure with the Alpaca-LoRA [6] strategy detailed list of the Italian QA pairs is reported in Ap-
testing the Alpaca-LoRA hyper-parameters5 for both 20 pendix B in Table 4, while the English version is reported
and 50 epochs. In the generation phase, models were in Table 5. Note that the English version is reported here
asked to output at most 256 tokens. We manually gen- for full readability purposes, while only Italian data were
erated a small set of Question-Answer (QA) pairs for used for evaluation.
evaluation starting from the real questions issued by the
public during the 2023 SHARPER European Researchers’
Night where Unipa-GPT was demonstrated. The proce-
                                                          5. Experimental results
dure for building these QA pairs is reported in Section The proposed model is intended to work in an open QA
4. We developed the entire system on a server with 2 context, where correct answers are not known, thus, after
Intel(R) Xeon(R) 6248R CPUs, 384 GB RAM, and two 48 a previous phase of qualitative evaluation [10] as in [17,
GB NVIDIA RTX 6000 Ada Generation GPUs.                   20, 21, 22], we opted for a quantitative analysis, relying
                                                           on the small QA data set described in Section 4 to evaluate
4. The data set                                            the performances against a set of golden labels in terms
                                                           of both retrieval and answering capabilities [30, 19, 18].
The Italian documents data set built for Unipa-GPT is         For each QA test pair, we retrieved the four most
called unipa-corpus [10], and it has been generated        relevant documents from each vector database related
5
                                                           to one of the open Embedding LLMs under investiga-
    https://github.com/tloen/alpaca-lora
Table 1
Context Relevancy scores over different Embedding LLMs. Bold values refer to the most relevant documents selected by
RAGAS among the first four documents retrieved using the RAG. Underlined values refer to the golden documents.
                          Q1                                    Q2                                  Q3
    model        D1       D2       D3       D4       D1         D2       D3       D4       D1       D2       D3       D4
 open-ai-ada     0,1      0,1    0,0833    0,0714   0,0909     0,125    0,625     0,333     0,1     0,111   0,111    0,111
  e5-mistral   0,0217   0,0345   0,0345    0,0233     0,5      0,0526   0,0333   0,025    0,0345   0,0154   0,0185   0,0435
     bge       0,0345   0,0217   0,0385    0,0233   0,0526     0,312    0,0333   0,0909   0,0345   0,0667   0,0435   0,0154
    bert-it    0,125    0,125    0,0345    0,0345    0,25      0,333    0,125    0,143    0,172     0,125   0,143    0,0192
     m-e5      0,125    0,0217     0,1     0,0833    0,25      0,333      0,5      0,5    0,333    0,0185    0,037   0,0345
                          Q4                                     Q5                                  Q6
    model        D1       D2       D3       D4       D1         D2       D3       D4       D1       D2       D3       D4
 open-ai-ada   0,167    0,0588      0,1    0,333    0,429       0,5      0,143    0,111     1       0,1     0,167     0,5
  e5-mistral   0,0417    0,05      0,05    0,276    0,154      0,04       0,5    0,0667   0,333    0,111    0,333    0,111
     bge       0,241    0,0417    0,333      0,1    0,154      0,04     0,333      0,05   0,182    0,111    0,444    0,333
    bert-it    0,152    0,0303    0,152    0,0303    0,5       0,04     0,0385   0,0769   0,111    0,333     0,25    0,25
     m-e5      0,333      0,5      0,5       0,5    0,154      0,333     0,167     0,5    0,333     0,5     0,111    0,333


tion. Then we scored the retrieved documents in terms           dimensionality reduction using t-SNE [33].
of their context relevancy with respect to the provided            We used the six QA test pairs to obtain also a quantita-
question using the RAGAS framework [31] that exploits           tive evaluation of the correctness of the answers provided
gpt-4-turbo for the evaluation. Results are reported            by all the Generation LLMs under investigation. Com-
in Table 1, and they include also the performances of           parison was carried out against both the golden answers
the original vector database using OpenAI embeddings            and the ones generated via gpt-3.5-turbo (GPT) in the
(text-embedding-ada-002 , referred as open-ai-ada ).            original Unipa-GPT set up. The proposed evaluation
The overall scores are not so high, and also the highest        task, can be regarded as an open QA one where, despite
relevancy do not always correspond to the golden docu-          a golden answer is provided for a given question, diverse
ment used for generating the corresponding answer. In           correct answers can be proposed with different linguistic
Table 1 the underlined values are the ones associated with      nuances, according to Italian diaphasic variation [34]. To
golden documents, while the bold ones are the highest           evaluate both strict and light correctness of the generated
RAGAS values. A model is considered to perform cor-             answers, we employed traditional QA metrics such as
rectly if the highest context relevancy score is assigned       BLEU [35] (Figure 2.a) and ROGUE-L score [36] (Figure
to one of the golden documents. This evaluation pro-            2.b) and novel metrics leveraging the RAGAS framework
cedure led to select E5-mistral as the best performing          [31] to evaluate Faithfulness (Figure 2.c) and Correctness
Embeddings LLM among the ones we investigated.                  (Figure 2.d) of the generated output. Such measures re-
   Superior performances of E5-mistral are also con-            quest an external LLM acting as a “judge”, and we we used
firmed by a deep analysis on the embeddings space by            gpt-4-turbo in this respect. More specifically, Faithful-
means of two different clustering procedures. We clus-          ness measures the factual consistency of the generated
tered the embeddings generated by each LLM starting             answer against the given context, while Correctness in-
from the documents belonging to both sections Educa-            volves gauging the accuracy of the generated answer
tional Offer and Future Students of the UniPA website.          when compared to the ground truth. Both metrics range
The firs group of documents is the list of all the available    from 0 to 1 and better performances are associated with
courses at the University, while the second group con-          higher scores.
tains useful information for future students who want to           Both BLEU and ROUGE scores are generally low, but
enroll in a degree course. We clustered the embedding           we assume that this is mainly related to the fact that
spaces according to the either the course degree typol-         an exact match cannot be reached between the golden
ogy (bachelor/master degree) or the Department where            answer and the generated one, and a more semantically
a degree course is affiliated to. Quantitative measures of      comparison should be taken into account. Overall, an-
the clustering goodness are reported in Table 2, where          swers generated by gpt-3.5-turbo can be considered as
the Silhouette Coefficients [32] have been computed for         the best ones as they attain highest values. By contrast,
each model, and again E5-mistral is the best performing         fine-tuning did not provided a desired improvement in
one. In Appendix C, we report the scatter plots of the          the open-source models: all BLEU scores are almost zero,
embedding spaces for each Embeddings LLM (Figure 3              except for Anita-8B . ROUGE scores are higher than
and Figure 4 ). Plots have been obtained through a 2D           the corresponding BLEU ones, and again the base ver-
Table 2
Silhouette Coefficients for each Embedding LLM with reference to the two proposed clustering schemes, that is the degree
courses typology and their affiliation to a particular Department.

                         Retriever     Silhouette score typology     Silhouette score Departments
                        openAI-ada              -0.0915                         -0.0627
                         E5-mistral             -0.0194                         -0.0048
                            BGE                 -0.0422                         -0.0708
                          BERT-it               -0.0221                         -0.0367
                           m-E5                 -0.0982                         -0.0503


Figure 2: Inference results over the generated answers according the following scores: (a) BLEU, (b) ROUGE, (c) Faithfulness
and (d) Correctness. Due to displaying reasons, (a,b) are represented in a [0, 0.6] range, while (c,d) in a [0,1] range.


sion of each LLM performes better than the fine-tuned          is not significantly beneficial in terms of performance
ones. Generally speaking, Anita-8B and Llama-3-8B-             increase for any model and, even if it does not reach
instruct outperform Minerva , since both reach com-            the same performances, Anita-8B seems to be the most
parable scores, but we assume that the tailored Italian        valuable alternative to GPT.
fine-tuning over Llama-3 to obtain Anita-8B was crucial           A manual inspection of the generated answers, out-
to make it the best performing open-source model during        lines a common issue related to the tokenization of the
this first automatic evaluation phase.                         generated output: despite of its semantic correctness, the
   gpt-3.5-turbo exhibits the best Faithfulness scores         generated text is outputted as a unique word without any
despite being surpassed by Anita-8B in question Q2, and        spaces, as
also these results confirm the previous considerations
                                                                       Glielezionidelcorsosaracondottoattraver-
about BLEU scores. Something changes in evaluating
                                                                       sounaprocesso
models in terms of their Correctness: in this case gpt-
3.5-turbo is the best model in three answers out of six,       in Llama-3-8B-instruct , or it is over-splitted as
followed by Anita-8B (two best results) and Minerva-
3B-20 (one best result). We are aware that gpt-based                   e-domandre d’i-s-c-r-i-z-ion-e-per-l-A.-
evaluation may lead to a preference over GPT models                    A.–2023–/—-cor-so-n-d’-l-a-u-re-’-(M-ag-
themselves, but gpt-4-turbo was the only high quality                  g-is-t-ra-le)–a-dd-ac-ce-o-lib-ro
generative model we had access to at the time of making        in Anita-8B . These errors make the models not suitable
the experiments.                                               for human interaction, since it is not possible read the
   Overall results confirm that a (moderate) fine-tuning       generated answers. We argue that a deeper analysis on
the tokenizer that has been used and, a hyper-parametrs       References
tuning in the generator, may lead to an increase of perfor-
mances. Models tend also to answer in other languages          [1] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
as                                                                 hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
                                                                   gava, S. Bhosale, et al., Llama 2: Open founda-
       * La durada édié depresso àdue años, * Ac-                  tion and fine-tuned chat models, arXiv preprint
       cesso libre! * Dipartment of Physics &                      arXiv:2307.09288 (2023).
       Chemistry “Emilo Segré” Codice course :                 [2] A. . M. Llama Team, The llama 3 herd of mod-
       21915                                                       els, 2024. URL: https://arxiv.org/abs/2407.21783.
                                                                   arXiv:2407.21783 .
in Llama-3-8B-instruct . We argue that this trouble
                                                               [3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-
can be related to the memory of multi-lingual models that
                                                                   plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-
uses texts also in French and Spanish despite the Italian
                                                                   try, A. Askell, et al., Language models are few-shot
fine-tuning. It is worth noticing that those languages are
                                                                   learners, Advances in neural information process-
linguistically close to Italian and together belong to the
                                                                   ing systems 33 (2020) 1877–1901.
Romance Languages [37]. Thus, even if the output has
                                                               [4] OpenAI, Gpt-4 technical report, 2024. URL: https:
to be considered wrong, a linguistic connection can be
                                                                   //arxiv.org/abs/2303.08774. arXiv:2303.08774 .
highlighted.
                                                               [5] L. Xu, H. Xie, S.-Z. J. Qin, X. Tao, F. L. Wang,
   The most unsatisfactory results are reported for
                                                                   Parameter-efficient fine-tuning methods for pre-
Minerva-3B : the model does not generate any answer
                                                                   trained language models: A critical review and
related to the given question, and it seems that answers
                                                                   assessment, 2023. URL: https://arxiv.org/abs/2312.
where generated with samples from model’s training
                                                                   12148. arXiv:2312.12148 .
set. As stated before, a tuning of the generator hyper-
                                                               [6] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li,
parameters may help in this case.
                                                                   S. Wang, L. Wang, W. Chen, Lora: Low-rank adap-
   Despite the promising results, in some cases answers
                                                                   tation of large language models, arXiv preprint
by both Anita-8B and Llama-3-8B-instruct are not
                                                                   arXiv:2106.09685 (2021).
good from a grammatical point of view, since they are
                                                               [7] T. Dettmers, M. Lewis, Y. Belkada, L. Zettlemoyer,
full of mistakes, thus making them not yet ready to be
                                                                   Llm.int8(): 8-bit matrix multiplication for trans-
used in real-world applications compared to OpenAI’s
                                                                   formers at scale, 2022. URL: https://arxiv.org/abs/
ones.
                                                                   2208.07339. arXiv:2208.07339 .
                                                               [8] M. Polignano, P. Basile, G. Semeraro, Advanced
6. Conclusions and future works                                    natural-based interaction for the italian language:
                                                                   Llamantino-3-anita, 2024. arXiv:2405.07101 .
In this paper we presented Open Unipa-GPT, a virtual           [9] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S. Conia,
assistant, which is based solely on open-source LLMs,              E. Barba, R. Navigli, Minerva technical report, 2024.
and uses a RAG approach to answer Italian university-              URL: https://nlp.uniroma1.it/minerva/.
related questions from secondary school students. The         [10] I. Siragusa, R. Pirrone, Unipa-gpt: Large lan-
main intent of the presented research was setting up a             guage models for university-oriented qa in ital-
sort of framework to test open-source small size LLMs,             ian, 2024. URL: https://arxiv.org/abs/2407.14246.
with either moderate or no fine-tuning at all, to be used          arXiv:2407.14246 .
for generating the embeddings and/or as text generation       [11] P. Lewis, E. Perez, A. Piktus, F. Petroni,
front-end in a RAG set up.                                         V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t.
   Our study led us to devise E5-mistral-7b-instruct               Yih, T. Rocktäschel, et al., Retrieval-augmented
as a valuable open-source alternative to OpenAI’s em-              generation for knowledge-intensive nlp tasks,
beddings, while none of the considered models attain a             Advances in Neural Information Processing
generation performance comparable to gpt-3.5-turbo ,               Systems 33 (2020) 9459–9474.
even after a fine-tuning procedure. The most promis-          [12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
ing Generation LLM, when plunged in our architecture,              Pre-training of deep bidirectional transformers for
appears to be Anita-8B , but it still shows some issues            language understanding, in: J. Burstein, C. Do-
related to both the tokenization and the grammatical               ran, T. Solorio (Eds.), Proceedings of the 2019 Con-
correctness of the output. We are currently working                ference of the North American Chapter of the As-
to deep exploration of different fine-tuning approaches            sociation for Computational Linguistics: Human
along with the use of huge size open-source LLMs for               Language Technologies, Volume 1 (Long and Short
text generation.                                                   Papers), Association for Computational Linguistics,
                                                                   Minneapolis, Minnesota, 2019, pp. 4171–4186. URL:
      https://aclanthology.org/N19-1423. doi:10.18653/           [25] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian,
      v1/N19- 1423 .                                                  Z. Liu, Bge m3-embedding:              Multi-lingual,
[13] M. Polignano, P. Basile, M. Degemmis, G. Semeraro,               multi-functionality, multi-granularity text em-
     V. Basile, Alberto: Italian bert language under-                 beddings through self-knowledge distillation,
     standing model for nlp challenging tasks based on                2024. URL: https://arxiv.org/abs/2402.03216.
     tweets, in: Italian Conference on Computational                  arXiv:2402.03216 .
     Linguistics, 2019. URL: https://api.semanticscholar.        [26] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder,
     org/CorpusID:204914950.                                          F. Wei, Improving text embeddings with large lan-
[14] P. Basile, E. Musacchio, M. Polignano, L. Siciliani,             guage models, arXiv preprint arXiv:2401.00368
     G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod-                (2023).
     els for effective text generation in italian language,      [27] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder,
     2023. arXiv:2312.09993 .                                         F. Wei, Multilingual e5 text embeddings: A techni-
[15] A. Bacciu, G. Trappolini, A. Santilli, E. Rodolà, F. Sil-        cal report, arXiv preprint arXiv:2402.05672 (2024).
     vestri, Fauno: The italian large language model             [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     that will leave you senza parole!, arXiv preprint                L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
     arXiv:2306.14457 (2023).                                         tention is all you need, Advances in neural infor-
[16] A. Santilli, E. Rodolà, Camoscio: an italian                     mation processing systems 30 (2017).
     instruction-tuned llama, 2023. arXiv:2307.16456 .           [29] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam-
[17] S. D’Urso, F. Sciarrone, Ai4la: An intelligent chat-             ford, D. S. Chaplot, D. de las Casas, F. Bressand,
     bot for supporting students with dyslexia, based on              G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-
     generative ai, in: A. Sifaleras, F. Lin (Eds.), Gen-             A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang,
     erative Intelligence and Intelligent Tutoring Sys-               T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https:
     tems, Springer Nature Switzerland, Cham, 2024, pp.               //arxiv.org/abs/2310.06825. arXiv:2310.06825 .
     369–377.                                                    [30] S. Vidivelli, M. Ramachandran, A. Dharunbal-
[18] V. Bhat, D. Sree, J. Cheerla, N. Mathew, G. LIu, J. Gao,         aji, Efficiency-driven custom chatbot develop-
     Retrieval augmented generation (rag) based restau-               ment: Unleashing langchain, rag, and performance-
     rant chatbot with ai testability, 2024.                          optimized llm fusion., Computers, Materials & Con-
[19] M. Kulkarni, P. Tangarajan, K. Kim, A. Trivedi, Re-              tinua 80 (2024).
     inforcement learning for optimizing rag for do-             [31] S. Es, J. James, L. Espinosa-Anke, S. Schockaert, Ra-
     main chatbots, 2024. URL: https://arxiv.org/abs/                 gas: Automated evaluation of retrieval augmented
     2401.06800. arXiv:2401.06800 .                                   generation, arXiv preprint arXiv:2309.15217 (2023).
[20] T. Boccato, M. Ferrante, N. Toschi, Two-phase rag-          [32] P. J. Rousseeuw, Silhouettes: a graphical aid to
     based chatbot for italian funding application assis-             the interpretation and validation of cluster analysis,
     tance, 2024.                                                     Journal of computational and applied mathematics
[21] S. Ghanbari Haez, M. Segala, P. Bellan, S. Mag-                  20 (1987) 53–65.
     nolini, L. Sanna, M. Consolandi, M. Dragoni, A              [33] L. Van der Maaten, G. Hinton, Visualizing data
     retrieval-augmented generation strategy to en-                   using t-sne., Journal of machine learning research
     hance medical chatbot reliability, in: J. Finkelstein,           9 (2008).
     R. Moskovitch, E. Parimbelli (Eds.), Artificial Intel-      [34] G. Berruto,          Variazione diafasica,       2011.
     ligence in Medicine, Springer Nature Switzerland,                URL:          https://www.treccani.it/enciclopedia/
     Cham, 2024, pp. 213–223.                                         variazione-diafasica_(Enciclopedia-dell'Italiano)/.
[22] R. Figliè, T. Turchi, G. Baldi, D. Mazzei, Towards an       [35] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
     llm-based intelligent assistant for industry 5.0, in:            method for automatic evaluation of machine trans-
     Proceedings of the 1st International Workshop on                 lation, in: Proceedings of the 40th annual meeting
     Designing and Building Hybrid Human–AI Systems                   of the Association for Computational Linguistics,
     (SYNERGY 2024), volume 3701, 2024. URL: https:                   2002, pp. 311–318.
     //ceur-ws.org/Vol-3701/paper7.pdf.                          [36] C.-Y. Lin, Rouge: A package for automatic eval-
[23] J. Johnson, M. Douze, H. Jégou, Billion-scale simi-              uation of summaries, in: Text summarization
     larity search with GPUs, IEEE Transactions on Big                branches out, 2004, pp. 74–81.
     Data 7 (2019) 535–547.                                      [37] T. Alkire, C. Rosen, Romance languages: A histori-
[24] N. Muennighoff, N. Tazi, L. Magne, N. Reimers,                   cal introduction, Cambridge University Press, 2010.
     Mteb: Massive text embedding benchmark, arXiv
     preprint arXiv:2210.07316 (2022). URL: https:
     //arxiv.org/abs/2210.07316. doi:10.48550/ARXIV.
     2210.07316 .
A. unipa-corpus details
unipa-corpus [10] is a collection of Italian documents
that were retrieved directly from the website of the Uni-
versity of Palermo in Semptember 2023. The corpus is
divided in two main sections, namely Education, that
groups the available bachelor and master degree courses,
and Future Students where important information about
taxes payment and enrollment procedure are reported.
For fine-tuning purposes, a semi-automatic procedure,
involving gpt-3.5-turbo [3], was implemented to build
a QA dataset. In Table 3 are reported the statistics of
unipa-corpus .


Table 3
Number of documents and QA pairs in unipa-corpus .


                      Education      Future Students
   Documents             506               104
     Tokens            1072214           987424
  QA pairs train         506               269
   Tokens train        191612            68160
   QA pairs val          253               133
    Tokens val          93443            29675
B. Inference QA pairs

Table 4
Overview of 6 QA pairs manually generated used for evaluation purposes


  IDs    Questions                                Answers
                                                  Il professore di Intelligenza Artificiale 1 del corso di Laurea Magistrale
         Chi è il professore di Intelligenza
                                                  in Ingegneria Informatica è il professore Gaglio e l’insegnamento verrà
  Q1     Artificiale 1 per il corso di Laurea
                                                  erogato durante il primo semestre. Per maggiori informazioni vai su
         Magistrale in Ingegneria Informatica?
                                                  http://www.unipa.it/struttura.html?id=721
                                                  La presentazione delle domande di iscrizione per l’Anno Accademico
                                                  2023/2024 varia in base alla tipologia di corso. Per i corsi di studio
                                                  triennali e magistrali a ciclo unico, la domanda può essere presentata
                                                  dall’1 agosto al 30 settembre 2023 , mentre per i corsi magistrali, dall’1
         Quali sono le scadenze di iscrizione
  Q2                                              agosto al 30 novembre 2023, in entrambi i casi, è richiesto il pagamento
         ad un corso di laurea?
                                                  della prima rata delle tasse universitarie. Per i corsi di studio ad
                                                  accesso programmato locale, con prenotazione online o ad accesso
                                                  programmato nazionale, le immatricolazioni vanno effettuate entro i
                                                  termini previsti dal relativi bandi di concorso.
                                                  Il corso di laurea magistrale in Chimica presso l’Università degli Studi
                                                  di Palermo, ha sede a Palermo e ha una durata di 2 anni. L’obiettivo
                                                  del corso è la formazione di laureati con una solida preparazione di
                                                  base, che li ponga in grado di affrontare con competenza ed autonomia
                                                  sia attività professionali sia quelle di ricerca accademica ed industriale.
                                                  Per questo motivo il corso di studio è progettato con lo scopo di
         Come funziona la magistrale di
  Q3                                              completare la cultura scientifica nelle discipline chimiche fondamentali
         chimica?
                                                  e di introdurre conoscenze più avanzate. Quindi lo studente può
                                                  completare il proprio piano di studi, attenzionando gli insegnamenti
                                                  che meglio si adattano ai propri interessi specifici, come la chimica
                                                  dei materiali, dell’ambiente e dei beni culturali, la chimica
                                                  supramolecolare e la didattica della chimica. Per maggiori informazioni
                                                  vai su http://www.unipa.it/struttura.html?id=766
                                                  Ciao! Il corso di Laurea in Consulente Giuridico d’Impresa sembra
                                                  proprio fare al caso tuo! Il corso di laurea ha sede a Trapani e ha una
         ciao! sono un ragazzo appena uscito
                                                  durata di 3 anni e ti fornirà le conoscenze in ambito giuridico ed
         dal liceo che è interessato al settore
  Q4                                              economico-aziendalistico. Una volta terminato il percorso di studio
         legale,in particolare alle leggi sulle
                                                  potrai svolgere attività interdisciplinari che richiedono competenze
         aziende. dove potrei iscrivermi?
                                                  giuridiche, aziendalistiche e organizzativo-gestionali. Per maggiori
                                                  informazioni vai su http://www.unipa.it/struttura.htmlid=1557
                                                  È possibile recarsi in segreteria il lunedì, mercoledì e venerdì dalle 10.00
         come posso prenotare un                  alle 12.00, martedì e giovedì dalle 15.00 alle 17.00 . Puoi prenotare il
  Q5
         appuntamento in segreteria?              tuo turno attraverso la App SolariQ. Per maggiori informazioni vai su
                                                  https://www.unipa.it/servizi/segreterie/
                                                  Il pagamento delle tasse deve essere effettuato esclusivamente mediante
                                                  sistema PAgoPA (Pagamenti della Pubblica Amministrazione). Dopo aver
                                                  compilato la pratica online, è possibile pagare direttamente online con
                                                  il sistema PAgoPA o stampare il bollettino e pagare presso tabaccai
                                                  convenzionati o ricevitorie abilitate PAgoPA. Ulteriori informazioni sul
  Q6     Come si pagano le tasse?
                                                  pagamento via PAgoPA sono reperibili qui https://immaweb.unipa.it/
                                                  immaweb/public/pagamenti.seam, mentre è disponibile il Regolamento in
                                                  materia di contribuzione studentesca https://www.unipa.it/servizi/segreterie/
                                                  .content/documenti/regolamenti_calendari/2023/5105144-
                                                  def_regolamento-contribuzione–studentesca-2023—24-2.pdf
Table 5
English version of Table 4.


  IDs     Questions                           Answers
          Who is the Artificial               The Artificial Intelligence 1 professor for the Computer Engineering
          Intelligence 1 professor            Master degree course is Professor Gaglio and it will be delivered
  Q1
          for Computer Engineering            during the first semester. For more information go to
          Master degree course?               http://www.unipa.it/struttura.html?id=721
                                              The submission of applications for the Academic Year 2023/2024 varies
                                              according to the type of course. For three-year and single-cycle
                                              master’s degree courses, applications can be submitted from 1 August
          What are the deadlines              to 30 September 2023, while for master’s degree courses, from 1 August
  Q2      for enrolling in a                  to 30 November 2023; in both cases, payment of the first instalment of
          degree programme?                   tuition fees is required. For courses with local programmed access,
                                              with online booking or national programmed access, enrolment must be
                                              carried out by the deadlines set out in the corresponding calls for
                                              application.
                                              The Master’s degree course in Chemistry at the University of Palermo
                                              is based in Palermo and lasts 2 years. The aim of the course is to train
                                              graduates with a good background, enabling them to deal competently
                                              and independently with both professional activities and academic
                                              and industrial research. For this reason, the course is designed
          How does the master’s               with the aim of completing the scientific culture in the fundamental
  Q3
          degree in chemistry work?           chemical disciplines and introducing more advanced knowledge.
                                              Therefore, students can complete their study plan by focusing on the
                                              subjects that best suit their specific interests, such as the chemistry
                                              of materials, the environment and cultural heritage, supramolecular
                                              chemistry and the didactics of chemistry. For more information
                                              go to http://www.unipa.it/struttura.html?id=766
                                              Hi! The Bachelor of Business Law Consultant programme sounds like it
                                              could be just the thing for you! The degree course is based in Trapani
          hello! I’m a guy just out of high
                                              and lasts 3 years and will provide you with knowledge in the fields of
          school who is interested in law,
  Q4                                          law and business economics. Once you have completed the course you
          especially corporate law.
                                              will be able to carry out interdisciplinary activities requiring legal,
          where should i apply?
                                              business and organisational-managerial skills. For more information
                                              go to http://www.unipa.it/struttura.html?id=1557
                                              You can go to the secretariat on Mondays, Wednesdays and Fridays from
          how can i book an appointment       10 a.m. to 12 noon, Tuesdays and Thursdays from 3 p.m. to 5 p.m. .
  Q5
          at the secretariat?                 You can book your appointment through the SolariQ App. For more
                                              information go to https://www.unipa.it/servizi/segreterie/
                                              Fees must be paid exclusively through the PAgoPA (Public Administration
                                              Payments) system, which is accessed through the university portal. After
                                              completing the paperwork online, you can either pay directly online via
                                              the PAgoPA system or print out the payment slip and pay at a
  Q6      How do I pay fees?                  PAgoPA-enabled tax office. Further information on paying via PAgoPA
                                              can be found here https://immaweb.unipa.it/immaweb/public/pagamenti.
                                              seam, while the Student Contribution Regulations is available here
                                              https://www.unipa.it/servizi/segreterie/.content/documents/regulations_
                                              calendars/2023/5105144-def_regulation-student-contribution-2023-24-2.pdf
C. Embedding spaces


Figure 3: Scatter plots of embedding spaces labeled as for typology


Figure 4: Scatter plots of embedding spaces labeled as for department

</pre>