BERT-based questions answering on close domains: Preliminary Report Stefano Bistarelli1,*,† , Marco Cuccarini1,2,*,† 1 Department Mathematics and Computer Science University of Perugia Via Luigi Vanvitelli, 1- 06123 (Pg) Italy 2 Department of Biology University of Naples Federico II Via Cinthia, 26c- 80126 (Na), Italy Abstract Natural language processing has seen a revolution in recent years thanks to Large Language Models (LLMs), which are based on generative technologies and set new standards for the field’s main tasks (sentiment analysis, text classification, question answering, etc.). The main issue today with current LLMs are the hallucinations, which cause incomplete control over the model’s entire output and can lead to disastrous outcomes in critical contexts. This makes it impractical to use LLMs in a lot of contexts where a certain level of security and safety is required. We aim to develop a model that can’t hallucinate and reduce false replies, that can be more efficient in terms of time compared to various generative models, and that provides the possibility to explain and identify errors (if any). This is done by avoiding the use of LLMs based on the generation of text and instead using a model that selects the most relevant part of the text and, with an adequate reformulation of the sentence, provides the user with the required pieces of information. We use hotel policies and rules as a case study, but the proposed approach could be applied to all cases that involve questions about a given text. It is important to notice that this work does not require any type of fine-tuning or training on the particular data, making generalisations to other fields and contexts easy. Keywords Embedding, Question-answering, Knowledge representation, NLP, BERT 1. Introduction Open and closed domains are the two main groups into which the question-answer problem can be divided. In open domains, users assume that the system will be able to answer any questions they may have about general knowledge; in a closed domain, on the other hand, users expect to get an answer specific to a particular source document. In this research, we present a close-domain question-answering system that can explain a set of rules or documents to its users and that can be applied in various contexts (healthcare, legal, social, etc.). Today, LLMs represent the state of the art in many applications, yielding good results on question- answering (QA) tasks as well. One advantage of using LLMs is that 0 or few-shot learning [1] can be applied, meaning that the training can be done with a smaller number of samples than those needed for traditional fine-tuning. However, generative models bring with them a lot of limits related to their unpredictability; a typical example is hallucinations [2], which can cause catastrophic damage in sensitive contexts like public relations, safety, or security. In the case of important information, like rules or policies, it is difficult to rely on this type of generative model, and it is safer to use approaches that give the possibility of controlling the produced text in a different way. The system’s objectives are to understand user questions, contextualise them, identify relevant information in the document, and provide the user with a response. All of this needs to be done to avoid spreading false or imprecise information. In fact, the primary objective of our model is safety and to be able to avoid the dissemination of false information to users with possible serious consequences when considering, for instance, laws and CILC 2024: 39th Italian Conference on Computational Logic, June 26-28, 2024, Rome, Italy * Corresponding author. † These authors contributed equally. $ stefano.bistarelli@unipg.it (S. Bistarelli); marco.cuccarini@unina.it (M. Cuccarini) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings regulations. The final step will be to investigate the relationship between the answer and the query, as well as the key factors involved in selecting the answer from the document. The paper is structured as follows: Section 2 is devoted to background, whereas Section 3 describes how we acquired all of the data and their characteristics. Section 4 describes the sentence splitting and the similarity evaluation of the text. In Section 5, we detailed the evaluations of performance, and in Section 6, we described the techniques used to explain the outcomes. Section 7 was for the evaluation of the time of execution and the analysis of all possible critical answers. In Section 8, we draw some conclusions and present options for overcoming the limitations encountered in the state-of-the-art technique. 2. Background In this work, we primarily leverage two key notions associated with Natural Language Processing (NLP): embedding and question-answering. We will examine these two key conceptual turning points in the background of NLP as they currently stand. 2.1. Overview on embedding method for texts The goal of embedding is to transfer linguistic information about a text or a word to a vector of numbers that can be measured. For the embedding problems, the state of the art is defined principally by two types of models: Bidirectional Encoder for Representation of Transformers (BERT) [3] and Unified Pre-trained Language Model (UNILM) [4]. BERT uses a sequence of bidirectional encoder transformers [5], 12 for the base one and 24 for the large one, to encode a text. It considers the right and left contexts using language modelling that masks 15% of the words, pushing their prediction based on the context. The trained phase is broken into two parts: pre-training, which involves learning a huge amount of unlabeled data using an unsupervised approach. The second phase is fine-tuning, which is utilised in supervised learning to encode specific domains of data. UNILM is a multi-layer Transformer network that was pre-trained on large volumes of text. The unified LM is pre-trained for multiple language modelling aims and shares the same parameters. UNILM, like BERT, can be fine-tuned to adapt to different downstream workloads by adding task-specific layers. 2.2. Overview of question answering in close domain In this problem, the model is trained to predict short answers. The model is pre-trained on language understanding. During the pre-train phase, the model will employ the next sentence prediction function, which trains the model to check for correlation between two sentences. In the state of the art, BERT is usually used for its potential to encode the semantics of a sentence into a vector of numbers. That is used to relate the slice of text with the major similarity to the query of the user. In other words, when the system receives a query from the user, it divides the document into sentences. Each of these sentences is ranked in terms of semantic similarity concerning the question. Usually, the most similar sentence is used as an answer for the user. There are various techniques for similarity ranking and also for text division, but the structure of the various works is similar to the one described there. After fine-tuning, a supervised learning technique is utilised to change the topic’s domains, making the request’s answer more effective. In more detail, the parts of the models can be described as follows: • Sentence Splitting: The first step is to divide the model into different chunks of text; this text can be a simple sentence or an entire paragraph, with or without overlap or other characteristics. The choice of solution for this step is fundamental and will influence the future performance of the models. Figure 1: Various distributions. (a) Distribution of Policy Docu- ments Based on Nationality. (b) Top 5 answer distribution BERT_Base • Sentence representation: The second step is to find a way to measure the semantic similarity between them. The solution most commonly adopted is usually embedding. The goal of the embedding is to create a vector that can represent a sentence and the relationship between different sentences. If two sentences are similar, the two vectors encoded will be similar according to the evaluation functions. • Evaluation of similarity: For the selection of the answer, it is necessary to use some measure of similarity between a sentence in the document and the query. • Evaluation of the performance: it is possible to use a method similar to the ones cited in the previous step, or it is also possible to human-check or use methods that evaluate how much of the same word the two sentences share. 3. Data Collection for Hotel rules and policy The first focus of the paper is the creation of a dataset of question-answering. We collected from the internet, with some normal queries, 48 samples, each with unique lengths, layouts, and structures of policies or rules of different hotels, these samples are of real hotels present in different locations around the world. (see Figure 1a). Starting from those documents, we have created a question-answer dataset using generative mod- els. To produce question-answer pairings, Chat GPT3.51 was employed. We asked the model to generate 20 questions for each of the 48 rules documents. For purposes of comparison, we also requested Chat GPT3.5 to respond to these inquiries. In 30% of cases, 364 samples, Chat GPT created a double (359) or triple (5) sub-query. We decided to include the double or triple question in the evaluation, considering a question answered correctly when the request for one of the subqueries is provided correctly. The dataset2 consists of 960 question-answer pairs, 20 questions for each of the 48 collected documents. It’s important to note that while generating questions, we assumed that all queries made by users would be contextual to the information within the text. This approach aimed to simplify the problem by disregarding questions unrelated to the document’s content for now and addressing them for future work. 4. Sentence splitting and similarity evaluation The state of the art approach in QA considers the sentence splitting phase and then a similarity evaluation (to map questions to answers). 1 https://chat.openai.com 2 https://github.com/marcocuccarini/ChatBot-QA-Hotel-policies 4.1. Split the documents into sentences The first task is sentence splitting for the document. The first approach was to use the function sent_tokenize() of the library NTKL3 . The division in sentences was, however, too strict; the provided question was usually more articulate and involved more sentences. For this reason, we decided to divide the rules and policies not in sentences, but in periods using a function-based static rule. The periods are delimited by the dot character ("."); the limit for this solution is that the dots are also used for other purposes, such as abbreviations (Dott., Mr., Mrs., etc.), emails, time, etc. We then created a function to split the periods, considering the exceptions of: prefixes (Mr., St., Mrs., etc.), suffixes (Inc. Ltd., Jr., Sr., etc.), starters (Mr., Mrs., Ms., Dr., Prof., Capt., etc.), acronyms, websites, or emails. This improved the performance by making the answers more complete and well contextualised. 4.2. Sentence similarity and evaluation of similarity The embedding in this paper will be used to evaluate the similarity between the question and the answer [6]. As said previously, the goal of embedding is to create a vector that can translate the concept of semantic similarity to distance similarity. That implies that two sentences with similar meanings will have two vectors near each other in terms of space. I will select as an answer the slice of text most similar (according to the embedding) to the question. For encoding a sentence into an embedding vector, a valid approach is Siamese-BERT (SBERT) [7]. SBERT is a model based on BERT (Bidirectional Encoder for Transformers) specifically designed to quantify the similarity between two sentences and express it with their vector representation. The structure is based on two BERT models which transform each sentence into a vector. As sustained before, the power of the BERT model is the possibility of pre-training; the embedding can be produced for question answering, similarity correlation, and sentence classification. Different pre-trained models were trained on the question-answering dataset. The process of pre- training also defines the type of pooling and the evaluation function. In the context of QA for the SBERT, there are two models trained to be associated with high similarity: • “multi-qa-mpnet-base-dot-v1”[8]: It has been trained with 215 million tuples (question/answer). The model accepts a maximum sequence length of 512 characters, and for the similarity function, it uses the dot product. For the pooling, it uses [CLS] pooling, and the resulting vector has a size of 768. It is the biggest model for the SBERT architecture and has been optimised considering the dot_product as a similarity function. • “multi-qa-distilbert-cos-v1”: The model is a variant of the previous one; the only difference is the size (420 MB), the pooling, which uses the mean value, and the base model, which in this case is distilbert-base [9]. The two models are optimised considering the similarity function dot_product and also the euclidean distance for DistilBERT, so we used the euclidean distance for the model DistilBERT and the dot_product for the base model, formally defined: ⎯ ⎸ 𝑑 𝑑 ⎸∑︁ ∑︁ 2 ¯ ¯ 𝑒𝑢𝑐𝑙𝑒𝑑𝑒𝑎𝑛_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑋 , 𝑌 ) = ⎷ ¯ ¯ (𝑥𝑖 − 𝑦𝑖 ) 𝑑𝑜𝑡_𝑝𝑟𝑜𝑑(𝑋 , 𝑌 ) = (𝑥𝑖 · 𝑦𝑖 ) (1) 𝑖=1 𝑖=1 For the similarity evaluation for the Euclidean distance, the goal is to minimise the value produced, while for the dot product, the goal is to maximise it. The euclidean can produce only positive values, while the dot product can also produce negative values. They are both strictly related; in fact, they can be considered inversaly proportional. We compare every sentence with a question, and we take into consideration the answer to the slice of text with the greatest similarity. After that, the sentence is provided as an output to the user as the answer to the request. 3 https://www.nltk.org 5. Evaluation of performance This part contains the main focus of the article, exploring the performance produced and seeing how some features (length and quality of the document, ambiguity of the question, etc.) influence the results to explain the decisions of the model. For the evaluation of the efficiency of the model, we decided to use the human check to avoid any type of approximation; all the answers have been labelled as "correct" or wrong." Correct is when the requested information is present in the answer, and wrong is when it is not. In the case of a question with multiple requests, when the model answers correctly to one of these, the answer is labelled as correct. Later on, we decided to measure the performance by checking when the interesting answer is present in the top 5 elements in the list of sentences ranked on similarity. This was done with the focus of giving motivation to the errors of the model and exploring the possible solutions to these limitations. The results show the good performance of our model, achieving a correct answer rate of 0.815 in the case of the model base BERT, and the model DistilBERT archive satisfied results with a correct answer rate of 0.762, but as we expected, lower than BERT base. It is important to consider that the SBERT model for the production of the embedding has not received any fine-tuning; this is to keep the generation of the system as a fundamental goal. The generalisation provides a lot of pros but also has some cons. The absence of fine-tuning does not permit us to specialise the model in a precise domain (in this case, tourism), which makes the model more easily confused in a similar sentence. A model specialised in the documentation used for the test will recognise more easily the difference between two concepts (for example, check-in and check-out), which will be more complex for the general model to recognise. 5.1. Estimation of the value of the error To find possible explanations for the wrong answers in our model, we decided to examine the first five responses in addition to the first one. To comprehend how much of an inaccuracy there is. If the incorrect response is the second in the similarity rank, the error is not very significant. When something is absent from the top 5, it indicates a significant inaccuracy. So we selected all the answers predicted as wrong by the model and analysed if the correct answer was present on the top most similar sentence found by the model. For doing this, we consider the model base and do not use the model DistilBERT. The results (see Fig. 1b) indicate that the second answer was correct 0.09 of the time, with a lower probability of positions 3, 4, and 5. The case outside of the top five responses is 0.03 This information is interesting because it lets us reach 0.9 accuracy if we consider the first two possible answers, and if we select the top 5, the accuracy reaches 0.96, a value comparable to the results produced by Chat GPT but with no possibility of hallucinations. 6. Explainability of the error In this phase of the work, we investigate what influences the results, looking at the relationship between the quality of the document in terms of embedding and the position of the questions. The idea is that if the embedding of a document is well distributed in space, it will be less probable that the model will give wrong answers. This good distribution is also related to the position of the question. A question can be present in a slice of space well distributed in a document generally badly distributed, or, contrary, a good-spaced document can have a question in a not-well-spaced part of the space. We decided to derive from the Sum of Squared Error (SSE) the metrics that we will use to quantify the crowding of the points around the question. We define a ray around each question, and we sum the square distance for all the points that are nearest compared to the ray. We defined this Crowding_Level metric because the Sum of Squared Error (SSE) cannot measure the neighbour crown of a question in a proper way. In the SSE approach, it is considered that a point Figure 2: Embedding distribution produced by DistilBERT of a document with bad and good performance. can belong only to one point. In cases where a portion of the space is not well-spaced but has a lot of questions, the measure will result lower than expected. For this reason, we have decided to consider the sum of squares distances of all the points inside the ray r. In this way, the metric is extremely dependent on the number of sentences; the more points present in the space, the higher the metric. For this reason, we also decided to normalise the number of sentences present in the ray. Two principal factors influence the results of the metric: • The number of samples: Fewer samples in the point’s ray will be deemed to be well-spaced, therefore, the number of samples must be kept to a minimum. • The sum of all the square distances: This value must be maximised, and consider that the points must be widely separated and apart from one another. 𝑄 𝑁 ∑︁ ∑︁ ||𝑥 − 𝜇𝑖 ||2 𝐶𝑟𝑜𝑤𝑑𝑖𝑛𝑔_𝑙𝑒𝑣𝑒𝑙(𝑑𝑜𝑐) = , 𝑄 = 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠, 𝑁 = 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠. (2) 𝑄𝑁 𝑖=1 |𝑥−𝜇𝑖 |<𝑟 We used the Crowding_level metrics to assign a value for the error to each document for analysing the correlation between the accuracy of answering the questions and how the points are spaced on the embedding representation. To avoid any spurious correlation, we also considered the relation that bound the length of the document and the performance of the model. This analysis procedure has been applied to both the BERT base and DistilBERT. For the model DistilBERT, we used as a similarity function the Euclidean distance, and for the BERT base, we used the dot product. The 2D visualisation of the embedding representation (large n) is done thanks to the application of Principal Component Analysis (PCA), a model built to reduce the element with high dimensionality for different reasons (relation extraction, visualisation, etc.) while maintaining the space relation between the various points. For the plot of the point, DistilBERT is used, seeing that it is also optimised for Euclidean distance. That means two points near in terms of space will also be near in terms of similarity. We can see in the images of two documents with the sentence embedded according to the BERT model_base that one is an example of a document with great performance and another with the worst performance. These two samples show two opposite cases: one of the worst performances and one of the best (see Fig. 2). As we can see in the legend, the green points represents the sentence, and the cross and the triangle are for the questions with wrong and correct answers. It is clear how the document with index 18 has a well-distributed embedding around the questions. We can notice the same thing for the document in index 17, where the embedded points are more crowded around the questions. Table 1 Token average time production. OpenAI Azure OpenAI Anyscale Anyscale ChatBERT ChatDistilBERT GPT-3.5 GPT-3.5 GPT-4 Llama-2- Llama-2- 7B 70B 35ms 28ms 94ms 19ms 46ms 3.2ms 2.4ms Table 2 Some of the hallucinations of Chat GPT in the hotel policy framework. Index Role Text 1 Question How does the text advise guests to handle their valuables for added secu- rity? Answer Chat GPT The text does not specifically advise guests on how to handle their valuables for added security. Answer BERT Please keep your valuables in the special safes in your rooms. 2 Question Is there a specific age restriction for leaving children unattended in Hotel, and if so, what is it? Answer Chat GPT The text does not specify a specific age restriction for leaving children unattended. Answer BERT For safety reasons, it is not appropriate to leave children under 10 years of age without adult supervision in the room and other areas of Hotel 7. Time of execution and critic answers A fundamental aspect to consider is the result in terms of calculation time. Chat GPT time consumption is linear concerning the number of words or tokens4 . It calculates the average time required to produce a single word or token. The results demonstrate significant disparities across the most popular models (see Table 1), where the average time required to produce a single word or token is given. We can notice that the generative model is very expensive in terms of computation cost; every token produced requires a large amount of computational power for its prediction, and this procedure needs to be done every time a new token is produced because the new element will change the probability distribution of the words. However, a significant amount of computation is also needed for the text’s embedding, but only once for the document. Once the embedding version of the text is created, it can be saved on a dataset, and the model needs to encode only the new question produced by the user. This technique lets us handle large documents without a strong impact on computational efficiency. Considering the wrong prediction of Chat GPT hallucinations (see Table 2). We can tell from the answers that the wrong answers are related to sensitive topics, and information that is not accurate is critical in this context. Moreover, in some cases, Chat GPT does not find any answer. 8. Conclusions and future works We have shown the good performance reached by our model, with an accuracy of 81,5% in the case of the first answer and of 96% when the top 5 sentences are considered; such results are similar to Chat GPT’s best performance. In terms of time of execution, our model outperforms the LLM results. Moreover, we avoid any hallucinations and unpredictabilities that LLMs can produce. With our approaches, we can analyse what features of the space are involved in the question- answering process and how a well-spaced embedding for a document tends to produce better results. This fact lets us have elements for the improvement of the system and the reduction of errors. On the other side, for LLMs, it is difficult to explain why the errors occur, and it is also not possible to study what features are involved in the question and answer. 4 https://www.taivo.ai/__gpt-3-5-and-gpt-4-response-times/ We only take into account questions that are relevant to the document in this instance, and it would be beneficial to develop a similarity criterion that determines whether a question is relevant or not. One way to solve this could be to specify a threshold for determining whether or not a question is contextualised, based on the greatest difference between the query and the sentence that answers it. In order to provide a more accurate response to the question, we can also take a final look at the document’s quality to see if any sentences that aren’t quite clear can be reworded. Additionally, by using embedding, we may examine the features of the responses and the model’s motivation mistakes, all of which would be impossible with the use of LLMs and Chat GPT. We plan to also explore methods that combine the splitting of the text with the selection of the best-matching question answers. Not considering a static division based on text but also the influence of the level of similarity with the question. At the end, we plan to implement this method on other datasets for a fair comparison with the state-of-the-art, automate it, and avoid any bias in the process of evaluating the correctness of the answers. 9. Acknowledgment The authors are members of the Gruppo Nazionale Calcolo Scientifico-Istituto Nazionale di Alta Matematica (GNCS-INdAM). This work has been partially supported by: GNCS-INdAM, CUP_E53C23001670001; GNCS-INdAM, CUP_E53C22001930001; European Union - Next Generation EU PNRR MUR PRIN - Project J53D23007220006 EPICA: “Empowering Public Interest Communication with Argumentation”; University of Perugia - Fondo Ricerca di Ateneo (2020, 2021, 2022) - Projects BLOCKCHAIN4FOODCHAIN, FICO, AIDMIX, “Civil Safety and Security for Society”; European Union - Next Generation EU NRRP-MUR - Project J97G22000170005 VITALITY: “Innovation, digitalisation and sustainability for the diffused economy in Central Italy”; Piano di Sviluppo e Coesione del Ministero della Salute 2014-2020 - Project I83C22001350001 LIFE: “the itaLian system Wide Frailty nEtwork” (Linea di azione 2.1 “Creazione di una rete nazionale per le malattie ad alto impatto” - Traiettoria 2 “E-Health, diagnostica avanzata, medical devices e mini invasività”). References [1] C. D. Hromei, D. Croce, V. Basile, R. Basili, Extremita at evalita 2023: Multi-task sustainable scaling to large language models at its extreme (2022). [2] H. Alkaissi, S. I. McFarlane, Artificial hallucinations in chatgpt: implications in scientific writing, Cureus 15 (2023). [3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [4] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, H.-W. Hon, Unified lan- guage model pre-training for natural language understanding and generation, Advances in neural information processing systems 32 (2019). [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). [6] J. Wang, Y. Dong, Measurement of text similarity: a survey, Information 11 (2020) 421. [7] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019). [8] multi-qa-mpnet-base-dot-v1, https://huggingface.co/sentence-transformers/ multi-qa-mpnet-base-dot-v1, ???? Accessed: 2010-09-30. [9] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).