Taking decisions in a Hybrid Conversational AI architecture using Influence Diagrams Roberto Basile Giannini1,2,† , Antonio Origlia1,2,∗,† and Maria Di Maro1,2,† 1 Dept. of Electrical Engineering and Information Technology, University of Naples Federico II 2 Urban/ECO Research Center, University of Naples Federico II Abstract This paper explores the application of the Influence Diagrams model for decision-making in the context of conversational agents. The system consists of a Conversational Recommender System (CoRS), in which the decision-making module is separate from the language generation module. It provides the capability to evolve a belief based on user responses, which in turn influences the decisions made by the conversational agent. The proposed system is based on a pre-existing CoRS that relies on Bayesian Networks informing a separate decision process. The introduction of Influence Diagrams aims to integrate both Bayesian inference and the dialogue move selection phase into a single model, thereby generalising the decision-making process. To test the effectiveness and plausibility of the dialogues generated by the developed CoRS, a dialogue simulator was created and the simulated interactions were evaluated by a pool of human judges. Keywords Conversational AI, Decision-making, Influence Diagrams 1. Introduction probable continuation of the provided prompt, they leave entirely to the human reader the task of interpreting what In recent years, the success of neural networks has gen- the produced output might have meant. erated significant enthusiasm among professionals in the From a linguistics point of view, within the framework field of artificial intelligence as well as the general public. of Austin’s speech act theory [3] “saying something” Various applications, such as speech recognition, com- equals “doing something”; the act of producing a sen- puter vision and even interactive conversational mod- tence (locutive act) is fuelled by an intention (illocutive els like ChatGPT, have increasingly engaged users, in- act) that produces changes in the world (perlocutive act). evitably shaping their perception of AI. This perception This classic view of the act of speaking highlights that can have various implications, even within the scien- conversation is a form of intervention in the world: it tific community. Attributing human-level intelligence is put in action to alter in some way the conversational to the tasks currently accomplished by neural networks context. This same position is also found in the recent lit- is questionable, as these tasks barely rise to the level of erature about the role of causality in artificial intelligence. abilities possessed by many animals [1]. Neural-based Judea Pearl’s Ladder of Causation [4] puts intervention ca- approaches to artificial intelligence have been criticised pabilities on the second level of the ladder, characterised because of the limitations that are intrinsic to purely as- by the verb doing, as in Austin’s seminal work. In this sociative methods. One notable analysis of the problems work, machine learning capabilities are limited to the that come when considering linguistic material gener- first step of the Ladder, concerned with observational ated without a real understanding of the meaning of what capabilities, leaving interventional ones out. is being said is found in [2], which highlights that, be- From this perspective, a conversational agent that pro- cause of the way it is generated, content produced by duces language motivated by the achievement of a goal, GPT models adheres to at least one formal definition of thus modelling a raison d’exprimer, is an agent capable of bullshit. The fundamental problem with these models is using language with interventional purposes, which can that, while they are trained to capture surface aspects of be placed on the second step of the Ladder of Causation. communication, they are never exposed to the reasons A tool that aims to define conversational agents accord- why language is produced. When they output the most ing to this philosophy is the Framework for Advanced Natural Tools and Applications with Social Interactive CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Agents (FANTASIA) [5], an Unreal Engine1 plugin de- Dec 04 — 06, 2024, Pisa, Italy signed to develop embodied conversational agents. Built ∗ Corresponding author. upon the functionalities offered by the tool, the FAN- † These authors contributed equally. TASIA Interaction Model follows these main principles: Envelope-Open ro.basilegiannini@studenti.unina.it (R. Basile Giannini); antonio.origlia@unina.it (A. Origlia); maria.dimaro2@unina.it Behaviour Trees (BT) [6] are used to organise and pri- (M. Di Maro) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1 Attribution 4.0 International (CC BY 4.0). https://www.unrealengine.com/ CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings oritise dialogue moves; Graph Databases (i.e., Neo4j grated into the database. A MOVIELENSUSER node is [7]) are used for knowledge representation and dialogue created for each user in the dataset, and a RATED rela- state tracking; Probabilistic Graphical Models (PGM) tionship is established between the MOVIELENSUSER are used for decision making; LLMs are used to verbalise node and a MOVIE node for each reported rating in the the decisions taken by PGMs. dataset. In addition to a number of basic properties such The latest results obtained using FANTASIA, presented as name, year of birth and ratings, MOVIE and PERSON in [8, 9], used a decision system based on Bayesian Net- nodes are characterised by authority attributes and hub works to estimate probability distributions over ratings scores calculated by means of the HITS algorithm [14]. for users of a movie recommender system. The decision As discussed in [9], these network analysis measures help about the dialogue move was taken by a rule-based sys- model cognitive characteristics that are relevant for the tem taking into account these estimates. In this work, we selection of plausible arguments [15]. Finally, the graph further develop the approach by generalising the deci- database is used to store a dialogue state graph which sion process using a single model, an Influence Diagram tracks the agent’s relationships with the knowledge do- (ID) [10]. IDs represent an extension of BNs [11] since, main and other agents, including humans. This graph in addition to probabilistic nodes, they also contain: can be modified by the agent through speech acts in or- der to evolve it towards graph patterns that the agent • Decision nodes, which represent decision points identifies as goal patterns, i.e. a desired configuration for the agent and which may be multiple within of the dialogue state. In this way, the graph database the model. will be interrogated by the CoRS by extracting a relevant • Utility nodes, which represent utility (or cost) fac- sub-graph taking into account the knowledge base and tors and which will drive the agent’s decisions, belief of the system evolved during the conversation. since the objective will be to maximise the utility In the reference system, the decision-making level in- of the model. volves a BN dynamically generated on the basis of the Consequently, in addition to the modelling of proba- extracted relevant sub-graph. In particular, in the case of bilistic inference problems, the use of IDs also enables Movie Recommendation, the actors, films and genres are the modelling and solving of decision-making problems, nodes of the BN, while the (oriented) relations between in accordance with the criterion of maximum expected them represent the causal relations. Initially, each node utility. In this way, the ID encapsulates both the Bayesian is initialised by specifying its own CPT, which can either inference and the decision phase in a single, more flexible be pre-calculated or derived from parent nodes. This and elegant model. network is used to adjust the exploitation/exploration cycle, typical of recommendation dialogue [16], by taking into account the data extracted from MovieLens (soft ev- 2. Original system idence) and the feedback gathered through the dialogue with the user (hard evidence). This way, the BN can rep- The original system on which the proposed system is resent the probability of each movie and each feature to based was presented in [8, 9]. This system is a CoRS with be of interest for a user, after applying Bayesian inference. argumentative capabilities based on linguistic and cogni- Based on the information extracted from the Bayesian tive principles. From a design point of view, the original network, a module outside the PGM is responsible for the system followed the FANTASIA Interaction model and decisions taken. Specifically, the system decides whether the PGM of choice were Bayesian Networks (BN), imple- to recommend a candidate item (exploitation move) or, in mented using the aGRuM library [12]. the case of non-recommendation, to ask the most useful From the knowledge representation point of view, a question (exploration move), based on the criteria consid- graph database is adopted to host information derived ered in [8]. In the case of exploitation moves, in addi- from Linked Open Data (LOD) sources. For the purposes tion to item recommendation, argumentation is provided of this case study, the movies domain will be consid- based on the three most useful features, whose utility ered. The knowledge base is constructed by collecting is calculated as the harmonic mean of four (normalised) data from different sources and enriched using graph parameters related to cognitive properties [15]. data science techniques, which are employed to cap- ture latent information. The procedure is described in [13]. The main entities of the knowledge base are repre- 3. Proposed system sented by the labels MOVIE, PERSON and GENRE, which are interconnected by appropriate relationships (such as The proposed system based on IDs replicates part of the HAS_GENRE, WORKED_IN, and so on). Additionally, reference strategy: the aim of this work is to provide information from the MovieLens 25M2 dataset is inte- a first test of the capabilities of the IDs to handle the 2 problem so we concentrate on the fundamental steps of https://grouplens.org/datasets/movielens/ Table 1 The capabilities of the original system and the ones replicated in the new system using IDs (in bold). Tracked Question Question Scores beliefs types targets Wants Polar Movie Hub Likes Open Actor Authority Knows Genre Entropy the original strategy. Table 1 shows the characteristics of the reference system, highlighting the ones reproduced by the proposed system. The approach is inspired by the system presented in [17]. 3.1. ID for Movie Recommendation Figure 1: Generic ID structure related to the recommendation branch. BM nodes represent best movies, F nodes represent The current system is based on the previously intro- features and FM nodes represent feature movies, i.e. secondary duced knowledge base and uses the same principles for movies. The topology follows the causal relationships that the extraction of the relevant sub-graph. The decision- coexist between the entities involved. making core of the system is represented by the ID, again dynamically constructed from the relevant sub-graph. In particular, the construction of the ID is divided into by a sigmoid that takes as input the number of questions two parts. The first part concerns the recommendation asked by the system. The objective is to have a utility of branch, along which a decision is made whether or not not recommending that is maximum at the beginning of to make an exploitation move. For each movie, whether the dialogue and that as the number of questions asked a candidate or a secondary film, an uncertainty node is increases, the utility decreases, with an increasing rate generated, and the same is done for the individuals who of decrease. In this way, the system will be inclined are part of those films. In particular, the nodes related to always ask the user at least one question and never to films will be median nodes of the nodes related to the to exceed a certain number of questions. Thus, 𝑈𝑁 𝑜𝑅𝑒𝑐 individuals who worked on that film. In addition, the represents the system’s indecision with regard to the query used to extract the relevant sub-graph returns a possible recommendations it can give at that moment, an collection of votes assigned to films, which is used to ap- indecision that is expected to be greatest at the beginning ply soft evidence to each of the movie nodes (both target of the dialogue since the system does not yet have any and secondary). For each candidate film, an EST(Movie) information about the user. In addition to the utility of uncertainty node related to the estimator operating on not recommending, the function defines the utility of that film is contextually generated. Indeed, within an recommending a particular candidate movie m: ID it is possible to take into account the truth (the best movies in this case) and the estimate on the truth (the 𝑈𝑚 = (2𝑈𝑚𝑎𝑥 ⋅ 𝑟𝑚2 ) − 𝑈𝑚𝑎𝑥 ∈ [−𝑈𝑚𝑎𝑥 , 𝑈𝑚𝑎𝑥 ] (2) estimators on the best movies). Furthermore, the ID also takes into account the uncertainty of the estimator if where 𝑟𝑚 represents the rating assigned to the movie the CPT of the EST nodes is initialised using the relative candidate m normalised between 0 and 1. In this way, the confusion matrix. As shown in Fig. 1, this information to- utility of the recommendation will be linked to the true gether will influence the system’s decision on a potential rating of the candidate movie and its value will be nega- exploitation move. Which decision will be made about tive for low ratings and positive for high ratings. Thus, the Recommendation node will depend on the utility func- recommending an item with a low true rating will be tion governing the goodness of possible choices. This punitive compared to an item with a high true rating. The function defines the utility value of not recommending, objective is to prioritise the recommendation of movie i.e. the utility of not performing an exploration move: candidates with higher true ratings and to disfavour the recommendation of those with low true ratings, possibly 1 by preferring an exploration move. 𝑈𝑁 𝑜𝑅𝑒𝑐 = 𝑈𝑚𝑎𝑥 ⋅ ∈ [0, 𝑈𝑚𝑎𝑥 ] (1) The second part of the process concerns the explo- 1 + 𝑒 𝑛𝑇 𝑢𝑟𝑛−5 ration branch, during which an exploration move is made. where 𝑈𝑚𝑎𝑥 represents the maximum utility that can be The underlying assumption is that if the utility of ”not given to a choice, while the second contribution is given recommending” is greater than that of recommending Table 2 The cost function the system considers when deciding to ask questions. Recommendation What question Cost Movie Actor −100 Movie No question 0 No Actor −99 ⋅ (1 − 𝐻 (𝑓 )) − 1 No No question −100 Figure 2: Generic ID structure related to the exploration branch. For each feature (actor, director, etc.), an uncertainty 3.2. Simulation node H is generated, representing its entropy. These nodes, together with the previous decision to recommend or not to The current system was tested by simulating a dialogue recommend, condition the choice of question to ask, which between the system and a MovieLens user whose an- has a cost. swers are derived from ratings recorded in the dataset. At the beginning of the conversation, the agent has no information about the user and for this reason the user a movie, then a question must be asked. In particular, immediately specifies the preferred genre. This informa- the most useful question must be chosen. In this case tion is derived by searching the database for that genre study, as anticipated, the exploration only involves the for which the average rating of that particular user is entropy of the features, not taking into account other the highest. All following questions are polar questions aspects of the features and other nodes. In particular, and concern PERSON type features. Again, the answer for each feature f extracted from the database, an uncer- is derived by considering the ratings given by the user tainty node H(f) is contextually created. Each node H to the ARGITEMs associated with that feature. Once represents the entropy of the related feature. A decision the genre is known, a positive belief likes is created that node What question is in charge of deciding which ques- associates the user with the preferred genre and at this tion should be asked, and depends on both nodes H and point the database is queried by extracting the best three, the decision node Recommendation, generating a decision the related features and the secondary films. If, from the sequence starting from the latter. The idea is that the ID, the best action is to recommend, the system proposes choice of question must depend both on the entropy of one of the candidate films to the user; otherwise, if the the features that can be chosen and on the decision that best action is not to recommend, the system asks the was made at the time of the recommendation, i.e. the most useful question. If the user’s answer consists of decision to perform an exploitation or an exploration a positive or negative preference, this involves adding move. Among the possible choices of What question, in evidence in the system, adding the user’s stance on that fact, there is also a No question, which only makes sense feature to the database and reconstructing the ID from to choose in the case of an exploitation move. Finally, a a dataframe extracted with the same query used at the Cost utility function represents the utility of the What beginning of the dialogue. The idea is that by keeping question choice. Fig. 2 shows the structure of the explo- track of the user’s stances collected as the system asks ration branch in a generic form. Tab. 2 shows the cost questions, it is possible to extract target movies that are associated with each decision sequence that the system more consistent with the user’s preferences. When a film is capable of undertaking. In particular, the highest cost, is recommended, the system also provides arguments to equal to −100, is applied to those decision sequences that support its choice, consisting of a selection of the most are to be avoided. Conversely, the lowest cost, equal important features related to the recommended film, thus to 0, is applied to the case where the system does not implementing Argumentation-based dialogue [18]. The ask questions. A variable cost, between −100 and −1, dialogue provided by the simulation is constructed by is calculated in the case where the system decides not using templates causing the generated conversation to to recommend and ask a question about an actor. The sound unnatural. For this reason, these template-based magnitude of this cost will depend on the entropy value dialogues were reformulated by ChatGPT-4 to make the of the relevant uncertainty node. The higher the entropy, conversation more natural, using the following prompt: the lower the cost of the corresponding question. The Rephrase the following dialogue to make it sound more idea is to collect evidence on the uncertainty nodes on natural. Keep the structure and only change the sentences. which the model’s uncertainty is most concentrated, as the system’s objective is to lower the model’s entropy level before making a recommendation. Figure 3: Survey task and questions posed to participants for each dialogue. (a) Results obtained by the original system 4. Experimental setup The experimental phase followed the approach used in [8]. The approach involves recruiting 20 participants via the Prolific3 portal who were asked to complete a survey on the Qualtrics4 platform that involves the evaluation of 20 dialogues divided into three types: • Five dialogues taken from INSPIRED Corpus [19], a dataset of human-human interactions for Movie (b) Results obtained by the proposed system Recommendation. These dialogues represent the positive subset of the control group. Figure 4: Comparison between the results obtained by the • Five system-generated dialogues where both the original system based on BN (a) and by the proposed system extraction of candidate films and the choice of based on ID (b). The obtained results show higher scores than the baseline represented by the negative dialogues but supporting features are random, independent of not as high as the ones obtained by the original system. The system belief. These dialogues represent the neg- difference between the two systems is expected as only part ative subset of the control group. of the original strategy is replicated in this work, excluding • Ten dialogues generated by the system using the a series of significant aspects, such as asking open questions presented strategy, which represent the target and discussing films as well as the people who work in them. dialogues. Fig. 3 shows the survey task with the four questions asked to the participant for each dialogue, for which the dialogues are higher than those obtained by the negative participant gives a score between 1 and 5. Q1 refers to the dialogues and lower than those obtained by the positive consistency of the questions asked during the exploration dialogues. In particular, the difference between target move, in order to understand whether the features are and negative dialogues is more pronounced on Q4, which selected correctly during the dialogue. Q2 and Q3 refer to is an indicator that the supporting arguments make the the naturalness of the dialogue, with the latter referring recommendation plausible. to the user’s perception of the recommender’s level of As an objective measure, during the generation of the expertise. Finally, Q4 refers to the quality of the features dialogues for each round, the average normalised en- chosen to support the recommendation. In conclusion, tropy of the ID was recorded, calculated as the average of the participants were native English speakers living in the normalised entropy among all variable nodes of the the UK or US and they were compensated according to model. In Fig. 5 it can be observed that a) during a tar- the average hourly wage of their home country. get dialogue the average entropy of the model decreases, in contrast to the case where b) the dialogue is random and the average entropy of the model does not tend to 5. Results decrease. The first scenario is compatible with the idea that the system accumulates information as the dialogue Fig. 4 shows the scores obtained by the current system progresses, in accordance with the strategy adopted. In based on ID for each question blue(b), compared with the second scenario, on the other hand, the ID is regen- the scores obtained by the original system based on BN erated at each turn from randomly extracted candidate (a). In both instances, the scores obtained by the target films, making it unlikely that the new extracted features 3 contribute in accumulating coherent information. https://www.prolific.com/ 4 https://www.qualtrics.com/ To further analyse the data concerning the synthetic di- the probabilistic model. The results achieved in this case were lower than the ones of the original system, but this was expected as only part of the original strategy was replicated. Future work will cover the implementation of the missing functionalities and the deployment of the system in the Unreal Engine, as the technology to imple- ment IDs has been integrated in the FANTASIA plugin. We will also investigate the possibility of integrating the Figure 5: Trend of normalised mean entropy of the ID during argument selection process in the ID to fully support (a) target dialogues and (b) random dialogues. These trends Argumentation Based Dialogue. were obtained by measuring the entropy of the system during the generation of ten target dialogues in (a) and ten random dialogues in (b). Acknowledgments This work is supported by the Supporting Patients with Embodied Conversational Interfaces and Argumentative alogues, we use a Cumulative Link Mixed Model (CLMM) Language (SPECIAL) project, funded by the University [20] with Laplace approximation, [21]. This model ac- of Naples on the ”Fondi per la Ricerca di Ateneo” (FRA) commodates random effects attributable to individual program (CUP: E65F22000050001) participants or specific stimuli, treating them as blocking variables and assesses the likelihood of observing high values on the Likert score in relation to the independent References variable (i.e., dialogue type). The test revealed that the association between the occurrence of high scores, in [1] A. Darwiche, Human-level intelligence or animal- general, is very strong (𝑝 < 0.001) for both target and like abilities?, Communications of the ACM 61 positive dialogues and, as expected, absent for negative (2018) 56–67. values. This result is stronger with respect to the results [2] M. T. Hicks, J. Humphries, J. Slater, Chatgpt is obtained in [8, 9], where only a weak association was bullshit, Ethics and Information Technology 26 observed. There are multiple aspects that contribute to (2024) 38. this result, in our opinion. First of all, in the original [3] J. L. Austin, How to Do Things with Words., Claren- work, the 𝑝-value was already very close to the strong don Press, 1962. significance threshold (𝑝 = 0.0144), so the effect was only [4] J. Pearl, D. Mackenzie, The book of Why, Basic technically considered weak even in that case. Also, there Books, 2018. is a chance that the simplified situation may have harmed [5] A. Origlia, F. Cutugno, A. Rodà, P. Cosi, C. Zmarich, negative dialogues more than the other two categories. Fantasia: a framework for advanced natural As a final remark, however, the IDs have indeed made tools and applications in social, interactive ap- the decision process more uniform and flexible, given the proaches., Multimedia Tools and Applications 78 introduction of utility functions and a unified framework (2019) 13613–13648. for decision making. The quality improvement of the de- [6] G. Flórez-Puga, M. A. Gomez-Martin, P. P. Gomez- cision process management, especially in deciding when Martin, B. Diaz-Agudo, P. A. Gonzalez-Calero, to recommend, given the available arguments to support Query-enabled behavior trees, IEEE Transactions the position has improved the system even in its basic on Computational Intelligence and AI in Games 1 form. (2009) 298–308. [7] J. Webber, A programmatic introduction to neo4j, in: Proceedings of the 3rd annual conference on 6. Conclusions & future work Systems, programming, and applications: software for humanity, ACM, 2012, pp. 217–218. The results obtained indicate that the implementation of [8] M. Di Bratto, A. Origlia, M. Di Maro, S. Mennella, a knowledge graph exploration strategy based on the ID Linguistics-based dialogue simulations to evaluate is more effective than a random strategy. This conclusion argumentative conversational recommender sys- is further supported by objective measures, including the tems, User Modeling and User-Adapted Interaction system’s entropy, which decreases as the system accu- (2024) 1–31. mulates information during the dialogue before making [9] M. Di Bratto, A. Origlia, M. Di Maro, S. Mennella, a recommendation. It is therefore possible to generalise On the use of plausible arguments in explainable within an ID a decision-making process that, in the orig- conversational ai, in: Proceedings of Interspeech, inal system, was implemented by a module external to 2024. [10] R. A. Howard, J. E. Matheson, Influence diagrams, Decision Analysis 2 (2005) 127–143. [11] A. Biedermann, F. Taroni, Bayesian Networks and Influence Diagrams, 2023, pp. 271–280. [12] G. Ducamp, C. Gonzales, P. Wuillemin, agrum/pya- grum : a toolbox to build models and algorithms for probabilistic graphical models in python, in: Pro- ceedings of the 10th International Conference on Probabilistic Graphical Models, volume 138, PMLR, 2020, pp. 609–612. [13] A. Origlia, M. Di Bratto, M. Di Maro, S. Mennella, A multi-source graph representation of the movie domain for recommendation dialogues analysis, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 1297–1306. [14] J. M. Kleinberg, Authoritative sources in a hyper- linked environment, Journal of the ACM (JACM) 46 (1999) 604–632. [15] F. Paglieri, C. Castelfranchi, Revising beliefs through arguments: Bridging the gap between argu- mentation and belief revision in mas (2005) 78–94. [16] C. Gao, W. Lei, X. He, M. de Rijke, T.-S. Chua, Advances and challenges in conversational rec- ommender systems: A survey, arXiv preprint arXiv:2101.09459 (2021). [17] P. Lison, C. Kennington, Opendial: A toolkit for developing spoken dialogue systems with proba- bilistic rules, in: Proceedings of ACL-2016 system demonstrations, 2016, pp. 67–72. [18] H. Prakken, Historical overview of formal argumen- tation, volume 1, College Publications, 2018. [19] S. A. Hayati, D. Kang, Q. Zhu, W. Shi, Z. Yu, In- spired: Toward sociable recommendation dialog systems, in: 2020 Conference on Empirical Meth- ods in Natural Language Processing, EMNLP 2020, Association for Computational Linguistics (ACL), 2020, pp. 8142–8152. [20] A. Agresti, Categorical data analysis, volume 792, John Wiley & Sons, 2012. [21] Z. Shun, P. McCullagh, Laplace approximation of high dimensional integrals, Journal of the Royal Statistical Society: Series B (Methodological) 57 (1995) 749–760.