Chatbots with Deep Learning Techniques: the Influence of Dataset Variation on User Experience Deola Simone1,2[0000−0001−5531−6684] , Ricardo A. Matamoros A.1,2[0000−0002−1957−2530] , and Francesco Epifania2 1 Department of Computer science, University of Milano Bicocca, Milan, Italy {s.deola1, r.matamorosaragon,}@campus.unimib.it 2 Social Things srl, Milan, Italy {simone.deola, ricardo.matamoros,francesco.epifania}@socialthingum.com Abstract. The main purpose of this research is to find a correlation between the changes on chatbot training dataset and the impact on the performances of the chatbot itself. Keywords: Chatbot · Convertion system · Deep Learning · Human Computer Interaction. 1 background of the problem Nowadays the chatbot technology is widely used in human machine interac- tion[1], allowing people to interact with complex backend systems using the natural language. There are a lot of examples in the consumer market ( e.g. Alexa, Siri, ok Google ecc)[2] as well as in the corporate market ( e.g. chatbot for customer service). In both cases the general goal of a chatbot is to commu- nicate properly with the user, by responding to the user request in a coherent and in the most ”natural” way. Therefore, chatbots need to classify each users request according to the specific use cases they are able to manage. The chatbot classifier is trained by exploiting a set of sample phrases, taken from previous usage or handmade, divided into groups called intents. The more these phrases are similar to the phrases that will be inputted by the user, the more the classi- fier of the chatbot will perform well. Another characteristic that will affect the performances is the way these phrases are grouped into intents. Each intent must contain phrases that share a topic, in order to provide the chatbot pipeline with indications on how to handle each single request. In other words, the intent must contain phrases that need to be answered in a similar way by the chatbot. Due to the nature of the user interaction with a chatbot, the first setup of the dataset can become outdated during the lifetime of the chatbot. This lack of updated settings impact on the chatbot performances. This means that it is necessary to operate some changes on the chatbot to adapt it to new configurations. In order to maintain a healthy classificator, the possible changes that can be done are related to the phrases used to train it, so by removing or adding training phrases, 2 Deola Simone or by changing the intents composition, so by changing the intent assigned to each phrase, always guided by the needs of the users. For instance, there can be users asking for new information that was not included in the previous training. In this case a new intent must be added to the chatbot[3]. Other problems can be related to the responses of the chatbot, that can be vague as regards a subset of phrases belonging to some intents. In this case the intent can be split into multiple intents that cover the particular cases[4]. There also can be topics no more relevant for the chatbot so the relative intents, and relative phrases, can be eliminated. 2 problem description Nowadays, the only way to estimate the impact of the chatbot’s intents changes on its performances is done by training the chatbot itself and testing it. Those procedures are expensive as regards both time and economic costs. The purpose of this research is to train a model that can predict the impact of intents modifi- cation on the chatbot performances without the need of training it. The chatbot modifications considered will be intent removal, addition or modification. The intent modifications will be composed of phrases addition, rimotion or modifica- tion. During the data collection phase multiple chatbots will be trained, starting from the same set of phrases. The differences from one training to another will be the assignment of each phrase to an intent and the phrases selected. These changes will be done in order to simulate the possible configurations that the intent modification proposed could generate. For instance, the chatbot could be trained starting from all the phrases and then removing one group of phrases belonging to an intent at a time. The same could be done for each modifica- tion proposed. Every time a chatbot is trained, the relative performances will be calculated and assigned to the training dataset configuration. In this way, a new dataset is created, composed by the input training dataset, paired with the performances of chatbot training on it. From now on we will refer to this dataset as Performances dataset. Since the purpose of this research is to study the impact of these changes on the chatbot system in the most complete way, different chatbots and parts of the chatbot will be trained. Since the intents compositions impact mostly on the classifier part of the chatbot, the first step will be done on that. This means that the chatbot will be composed of a single NLP classifier[5], and the relative performances will be precision, recall and f1- score. Different NLP classifiers will be explored. A second step will be focused on using a complete chatbot for training and evaluation. This second test will study the impact of the dataset modification on the overall system of a chatbot. In order to evaluate the performances of these complex systems different metrics can be used, for example Dialogue efficency in terms of matching type, Dialogue quality metrics based on response type and Users’ satisfaction assessment based on an open-ended request for feedback[6]. Each one of those will highlight differ- ent aspects of the chatbot. Once the Performances dataset is created, the main methods of prediction will be applied, such as regression, Neural Network, ecc, the influence of dataset variation on user experience 3 in order to infer the performance of a chatbot[7]. During all the steps described above, in order to operate on the data, all the data manipulation techniques for Natural Language will be used ( Embedding, PCA, Lemmatization, ecc.) This process will be performed on multiple chatbot’s intents datasets, provided by the company. The chatbot datasets are composed of Italian phrases and cover different areas of content ( banking chatbot, FAQ chatbot for different Client services, ecc..). 3 conclusions The expected result is a model that, starting from any chatbot datasets config- uration, could predict the performances of the chatbot itself, without the needs of training it. The resulting model should be computationallyl simpler than the original chatbot and should be accurate enough, in order to remove the need of the training phase for evaluation. This kind of model could help the process of creation and maintenance of a chatbot, by providing the users with an in- strument that facilitates the assignment of intents to the training phrases and suggesting which phrases will be useful in the training process. A possible future implementation is to use the model to automatically correct the initial intent assignment and provide the best dataset for chatbot training, saving time and computational costs. References 1. Ciechanowski, Leon and Przegalinska, Aleksandra and Magnuski, Mikolaj and Gloor, Peter: In the shades of the uncanny valley: An experimental study of human– chatbot interaction. Future Generation Computer Systems (2019) 2. Batish, Rachel: Voicebot and Chatbot Design: Flexible Conversational Interfaces with Amazon Alexa, Google Home, and Facebook Messenger: An experimental study of human–chatbot interaction.Packt Publishing Ltd (2018) 3. W. Ye and Q. Li: ”Open Questions for Next Generation Chatbots,”:2020 IEEE/ACM Symposium on Edge Computing (SEC), 2020, pp. 346-351, doi: 10.1109/SEC50012.2020.00050. 4. Richard Csaky: Deep Learning Based Chatbot Models. CoRR, dblp computer sci- ence bibliography, https://dblp.org (2019) 5. Howard, Jeremy and Ruder, Sebastian : Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018) 6. Shawar, Bayan Abu and Atwell, Eric: Different measurement metrics to evaluate a chatbot system. Proceedings of the workshop on bridging the gap: Academic and industrial research in dialog technologies (2017). 7. lmansor, Ebtesam H. and Hussain, Farookh Khadeer. Survey on Intelligent Chat- bots: State-of-the-Art and Future Research Directions: Complex, Intelligent, and Software Intensive Systems, Springer International Publishing (2019). 8. Stefan Larson and Anish Mahendran and Joseph J. Peper and Christopher Clarke and Andrew Lee and Parker Hill and Jonathan K. Kummerfeld and Kevin Leach and Michael A. Laurenzano and Lingjia Tang and Jason Mars. An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction: https://dblp.org/rec/journals/corr/abs-1909-02027.bib (2019).