Towards a theoretical formalization of conversational recommendation Tommaso Di Noia1 , Francesco Maria Donini2 , Dietmar Jannach3 , Fedelucio Narducci1 and Claudio Pomo1 1 Politecnico di Bari, via Orabona, 4, 70125 Bari, Italy 2 Università degli Studi della Tuscia, via Santa Maria in Gradi, 4, 01100 Viterbo, Italy 3 University of Klagenfurt, Universitätsstraße, 65-67, 9020 Klagenfurt am Wörthersee, Austria Abstract Tools that interact vocally with users are becoming increasingly popular in the market, boosting industry and academia interest in them. In such environments, conversational recommender systems succeed in guiding users in situations of information overload. Through multiple interactions with users, such systems ask questions, filter the catalog in a personalized manner, and suggest items that are of potential interest to the consumer. In this context, conversational efficiency in terms of the number of required interactions often plays a fundamental role. This work introduces a theoretical and domain independent approach to support the efficiency analysis of a conversational recommendation engine. Observations from an empirical analysis align with our theoretical findings. 1. Introduction and Motivation increase the efficiency of the dialog [8, 9, 10, 11, 12, 13, 14]. Today, research in the general area of recommender System-generated recommendations have become a systems, and specifically area of 𝐶𝑅𝑆, is almost entirely common feature of modern online services such as e- empirical [15, 16, 17]. Such empirical studies are cer- commerce sites, media streaming platforms, and social tainly important and insightful. However, little is known networks. In many cases, the suggestions made by the about the theoretical aspects of the underlying interactive underlying recommender systems are personalized ac- recommendation processes. Unfortunately, theoretical cording to the user’s tastes, needs, and preferences. In questions regarding, e.g., the computational complexity the most prominent applications of recommender sys- of determining a good or the best interaction strategy, tems, user preferences are estimated based on past user can not be answered without a formal characterization behaviors. However, there are several application do- of the overall problem. mains where no past interaction logs are available or With this work, we address this research gap and pro- where the user’s needs and preferences might differ each vide a theoretical model of conversational recommenda- time the user interacts with the service (e.g., restaurant tion. The model is designed in a domain-independent recommendation for a party or a romantic dinner). In way and aims to cover a wide range of realistic applica- such application settings, a multi-turn, interactive rec- tion scenarios. A conversational recommendation pro- ommendation process is required, where the system’s cess is modeled as a sequence of states, where state tran- goal is to learn about the user preferences to the extent sitions correspond to common user intents and conversa- that appropriate recommendations can be made. Con- tional moves [18, 19, 20] that can be found in the literature. versational Recommender Systems (𝐶𝑅𝑆) support such Since our model is agnostic about the application do- processes and these systems received increased attention main and the algorithm that is used to select and rank in recent years [1, 2, 3, 4, 5, 6, 7]. the objects for recommendation (i.e., the recommenda- The preference elicitation process in such settings can tion algorithm) it serves as a basis to analyze important be implemented in different ways, ranging from prede- theoretical properties of 𝐶𝑅𝑆. fined fill-out forms to natural language interfaces—see The main contribution of this work1 is the study of the Jannach et al. [5] for an overview. In that context, a spe- computational complexity for finding an efficient conver- cific goal when designing a 𝐶𝑅𝑆 is to minimize the effort sational strategy in terms of number of dialog turns. In for users by asking as few questions as possible, i.e., to particular, we demonstrate that: (i) the problem of finding MICROS@CIKM’22: Proceedings of CIKM Workshop on Mixed- an efficient conversational strategy in terms of number Initiative ConveRsatiOnal Systems, October, 21, 2022, Atlanta, USA of dialog turns is NP-hard, but in PSPACE; (ii) some spe- $ tommaso.dinoia@poliba.it (T. Di Noia); cific factors of the item catalog influence the complexity tommaso.dinoia@poliba.it (F. M. Donini); of the problem; (iii) for a special class of catalogs, the tommaso.dinoia@poliba.it (D. Jannach); tommaso.dinoia@poliba.it upper bound lowers to POLYLOGSPACE. From a prac- (F. Narducci); tommaso.dinoia@poliba.it (C. Pomo) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 1 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 An extended version of this work is available in Di Noia et al. [21]. tical perspective, our analysis leads to the observation 4. upon rejection of one or more items, ask the user that the efficiency of a conversation strategy is tied to whether if there is a specific feature value of these the characteristics of the catalog. Observations from an items she dislikes (e.g., a specific color) . empirical analysis on datasets based on MovieLens-1M support these theoretical considerations. The user can react to the above system prompts with one of the following interactions: 2. Model Description (a) given one or more recommendations, the user In our theoretical framework we assume a retrieval-based can accept one of them, or reject them; item filtering approach, which is commonly used in (b) the user can state that she dislikes every item critiquing-based and constraint-based approaches to rec- where a features is filled with a particular value ommendation [22, 23]. In critiquing approaches, users are (e.g., “I don’t like green cellphones”) presented with a recommendation soon in the dialogue and can then apply pre-defined critiques on the recom- The formalization of the interaction process described mendations, e.g., (“less $$”) [5]. In analogy to database- by Di Noia et al. [21] allow us to establish the results oriented approaches, we therefore use the term “query” mentioned above. when referring to positive user’s preferences. Negative preferences are modeled as constraints on disliked item 3. Empirical Analysis features. The retrieved items are then ranked according to any type of information, e.g., the popularity of cer- 3.1. Experimental Design tain items. In order to carry a general analysis, in our One main theoretical result is that the chosen conversa- approach we abstract from the details of this ranking. tion strategy (protocol) not only impacts the efficiency of To model the conversational recommendation process, a CRS, but that the efficiency also depends on the charac- we rely on the notion of state of a conversation, and teristics of the item catalog, e.g., in terms of the number what transformations this state can be subject to, de- of available item features and the number of distinct val- pending on the interaction. For example, each preference ues. We devised an in-vitro (offline) experiment using expressed by the user leads to a change of the state of the two protocols to empirically validate this result. conversation and may also imply a change in the set of Protocols. A CRS may support two different ways recommendable items. This formalization through con- (protocols) of how users can reject a recommendation versation states ultimately serves as a basis to study the made the system: efficiency of conversational strategies. The most efficient conversational strategies will minimize the number of P1 - the user rejects the recommendation and the states which must pass through to reach an end. CRS does not ask the user to provide a specific In our model, we mainly deal with the system-driven reason, i.e., a reason that refers to a disliked fea- part of a conversation, where a conversation consists of ture value. Examples of such more unspecific a sequence of interactions.2 The system can perform one feedback—if any feedback is given at all—could of the following actions:3 be, “I don’t want to go to the Green Smoke restau- rant” or “I don’t want to see the movie American 1. ask the user to fill in (provide) a value for a partic- Beauty” (for some reason, but I cannot explain ular feature under-specified so far (e.g., the item this to a system); color); P2 - the user rejects the recommendation and the 2. ask the user to enlarge a too narrow choice for a CRS asks for a specific item characteristic (i.e., feature value (e.g., to change the price limit); feature value) she does not like at all. For example, 3. ask for changing a feature value (e.g., from green green color for cellphones, sea-view restaurants, to red for the color feature); a particular movie director, etc. We assume that a user will truthfully answer such questions. 2 In constraint-based and critiquing-based systems the recommender Hypotheses. Based on our theoretical results, we system usually drives the conversation in an initial preference elici- formulate two hypotheses, where the difference lies in tation phase. In typical implementations of such systems, the user the characteristics of the item catalog. can however also take the initiative and, for example, request rec- ommendations at any time or proactively revise their preferences. H1 We do not expect a strong difference in terms of 3 We note that in the area of Knowledge Representation, slot unfilling (e.g., retract any requirement about colors) could be considered a efficiency between P1 and P2 when the items in special case of knowledge contraction [24], while slot change (e.g., the catalog have few features with a large number from green to red) is a form of revision [25]. of distinct values. H2 We do expect a strong difference in terms of effi- existing preferences regarding item features and truth- ciency between P1 and P2 when the items in the fully responds to system questions about these prefer- catalog have several features with few distinct ences. When provided with a recommendation, the user values. either rejects it, which means that the dialog continues, or accepts it, and the dialog ends. The 𝐶𝑅𝑆 in our sim- Experiment Specifics. In our experiment, we simu- ulation implements one of the described conversation late the above-mentioned protocols and vary the under- strategies, P1 or P2. lying item catalog as an independent variable. Note that in our experiment we simulate a user cold- start situation, i.e., we are not taking any long-term Efficiency Metric. As commonly done in the litera- user profile into account during the dialog. In order to ture [14, 17], we use the number of questions (NQ) the simulate the response of a user, we first select a set of 𝐶𝑅𝑆 asks before the user accepts a recommendation as positively-rated items (PRI) for each user. This set con- an efficiency measure. Fewer questions indicate higher sists of those items in the dataset that the user has rated interaction efficiency. with a value that is greater or equal to their average rat- ing in the MovieLens dataset. We use this set PRI for two Dataset and Catalog Description. We rely on the purposes. First, we simulate a dialog for each element widely used MovieLens-1M (ML-1M) dataset for our ex- 𝐼 of PRI as an “ideal” item (that the user will accept). periment, which we enrich with item features using DB- Second, we use the items in PRI to determine the pre- pedia [26]. The resulting dataset comprises 3,308 items existing preferences of a user and simulate their answers with 279 unique features. From this dataset, we create to the questions posed by the system. Therefore, if the two versions to test our hypotheses. user previously liked action and romantic movies, the set of pre-existing preferences contains only these values. • Itemset1 (IS1) has only a few features but with a When the simulated dialog with a defined ideal item larger number of distinct values. It is designed to 𝐼ˆ starts, the system will ask a question on a feature e.g., support H1 (we do not expect a strong difference “What is your favorite genre?”. The simulated user will in terms of efficiency between P1 and P2 when then respond by choosing a value from the set of values the items in the catalog have few features with a for that feature occurring in PRI. In our simulation, the large number of distinct values). user cannot answer with a value that is not present in any recommendable object. After each user answer, the • Itemset2 (IS2), in contrast, has a larger number set of recommendable items 𝒮 is updated by the 𝐶𝑅𝑆 of features, but each of them only has a few dis- according to her answer. A recommendation is shown tinct values and is designed to support H2 (we when the system has no more questions to ask. This situ- expect a strong difference in terms of efficiency ation may occur when: (i) preferences on all the features between P1 and P2 when the items in the catalog have been expressed, (ii) only one item on the catalog is have several features with few distinct values). consistent with the user preferences. The user rejects the recommendation when 𝐼 is not present in the list of rec- Specifically, to make the datasets sufficiently different ommended items. If the recommendation is rejected, the and such that they reflect the characteristics described recommended items are removed from the catalog and in our research hypotheses, IS1 and IS2 have 4 and 10 the system starts again posing questions to the user. It is features respectively for each item. For each feature of noteworthy that, since we may have more than one item IS1 there are from 1,500 to 2,500 distinct values, whereas in the catalog described by the feature values selected there are about 100 values onn average for IS2. To better by the user during the dialog, the final recommendation focus on the main goals of our experiment, we make the may contain also a set of items. simplifying assumption that all features can have only In case of rejection, protocols P1 and P2 lead to dif- one value. Accordingly, we replaced set-valued features ferent system reactions. In case of P1, the system starts (e.g., the movie cast) with a single value randomly chosen querying the user again, beginning with a first randomly among them. selected feature. The values selected during the previous interactions are discared. In protocol P2, in contrast, the Simulation Procedure. We simulate the part of a con- user declares one of the feature values as disliked for the versation between a user and a 𝐶𝑅𝑆 where the sys- recommended items. tem drives the interaction by asking the user about pre- When the recommendation succeeds—i.e., when the ferred item features and making recommendations for ideal item 𝐼 is in the list of recommendations—the dialog items4 . We assume that the simulated user has certain pre- is successfully ended and the simulation continues with 4 Other parts of the conversations may include greetings or chit-chat. a new dialog for another user and/or target item. The For a catalog of common user intents in CRS, see [18]. simulation ends when a dialog was simulated for each 6,671 Protocol 1 Protocol 1 Protocol 2 1,023.75 Protocol 2 166.34 1,588 1,152 72.64 67.45 524 Itemset1 Itemset2 Itemset1 Itemset2 (a) (b) Figure 1: (a) the average number of Questions per Configuration; (b) the maximum number of Questions per Configuration element in PRI for every user. simulation confirmed what was foreseen by the theo- retical analysis: the difference between protocol P1 and 3.2. Results and Discussion protocol P2 shows up clearly only in a dataset with many We applied protocols P1 and P2 both for itemset IS1 and features with a small set of different values. IS2, and we counted the Number of Questions (NQ) re- quired to reach the test item in each configuration. Fig- 4. Summary and Outlook ure 1 summarizes the results. As expected from the theo- retical analysis, with IS1 we observe minor differences With this work, we contribute to a better understanding between the two protocols in terms of number of re- of theoretical properties of conversational recommen- quird questions made by the system. More specifically, dation problems and we specifically address questions P1 needs 72.64 as the average number of questions with related to the computational complexity of finding effi- IS1, whereas P2 needs 67.45. The difference is however cient dialog strategies. One main insight of our theoret- huge for IS2 where P1 needs more than 1,000 questions on ical analysis—which was also confirmed by an in-vitro average to reach the test item, while P2 requires around experiment—is that when designing an efficient conversa- 166 questions (Fig. 1a). Hence, we can confirm that when tion strategy, we must always consider the characteristics the items in the catalog have many features with a smaller of the item catalog. More specifically, we demonstrated number of distinct values, the efficiency of P2 grows dras- that when a few features characterize the items in the tically compared to P1. Also the maximum number of catalog with a large number of distinct values, the cri- questions confirms this different efficiency for IS1 and tiquing strategy based on asking the user about a disliked IS2 (Fig. 1b). We note that NQ in absolute terms is very characteristic of the recommended item does not give any large—even in the best combination (P2 with IS1), the significant advantage in terms of user effort. Conversely, number of questions is close to 70, which sounds too high when the catalog is composed of items with several fea- for practical applications. Recall, however, that in this tures with a few distinct values, a critique strategy based experiment we implemented a worst-case scenario and on item features can drastically reduce the user effort for our experiment used an unrealistic setting on purpose. reaching a liked recommendation. On a more general In our scenario the recommendation task is deliberately level, we hope that our work might help to stimulate difficult: more theory-oriented research in this area, leading us to a better understanding of the foundational properties of • for each dialog, there is only one test item (true this essential class of interactive AI-based systems. In positive); future research, we will investigate the explicit consid- eration of individual long-term user preferences in the • the CRS works in cold-start condition without interactive recommendation process. any user profile; • the CRS does not implement a cut-off on the num- Acknowledgement ber of questions to ask the user. The authors acknowledge partial support from the In conclusion, our experimental evaluation results align projects PASSEPARTOUT, ServiziLocali2.0, Smart Rights with our theoretical findings, thus providing support for Management Platform, BIO-D, and ERP4.0. our research hypotheses H1 and H2. In other words, our References tional recommender systems and natural language: A study through the converse framework, Decision [1] K. Christakopoulou, F. Radlinski, K. Hofmann, To- Support Systems 131 (2020) 113250. wards conversational recommender systems, in: [15] A. Iovine, P. Lops, F. Narducci, M. de Gemmis, G. Se- Proceedings of the 22nd ACM SIGKDD Interna- meraro, An empirical evaluation of active learning tional Conference on Knowledge Discovery and strategies for profile elicitation in a conversational Data Mining, 2016, pp. 815–824. recommender system, Journal of Intelligent Infor- [2] T. Shen, Z. Mai, G. Wu, S. Sanner, Distributional mation Systems (2021) 1–26. contrastive embedding for clarification-based con- [16] V. Bellini, G. M. Biancofiore, T. D. Noia, E. D. Scias- versational critiquing, in: Proceedings WWW ’22, cio, F. Narducci, C. Pomo, Guapp: A conversational 2022, pp. 2422–2432. agent for job recommendation for the italian public [3] K. Zhou, W. X. Zhao, S. Bian, Y. Zhou, J.-R. Wen, administration, in: EAIS, 2020, pp. 1–7. J. Yu, Improving conversational recommender sys- [17] D. Jannach, Evaluating conversational rec- tems via knowledge graph based semantic fusion, ommender systems, Artificial Intelligence Re- in: Proceedings KDD ’20, 2020, pp. 1006–1014. view (2022). doi:https://doi.org/10.1007/ [4] C. Gao, W. Lei, X. He, M. de Rijke, T.-S. Chua, s10462-022-10229-x. Advances and challenges in conversational recom- [18] W. Cai, L. Chen, Predicting user intents and satis- mender systems: A survey, AI Open 2 (2021) 100– faction with dialogue-based conversational recom- 126. mendations, in: UMAP ’20, 2020, p. 33–42. [5] D. Jannach, A. Manzoor, W. Cai, L. Chen, A survey [19] R. Reichman, Getting computers to talk like you and on conversational recommender systems, ACM me, MIT Press, Cambridge, Massachusetts, 1985. Comput. Surv. 54 (2021) 1–26. [20] D. Jannach, Finding preferred query relaxations [6] W. Cai, Y. Jin, L. Chen, Impacts of personal char- in content-based recommenders, in: Intelligent acteristics on user trust in conversational recom- techniques and tools for novel system architectures, mender systems, in: Proceedings CHI ’22, 2022, pp. Springer, 2008, pp. 81–97. 489:1–489:14. [21] T. Di Noia, F. M. Donini, D. Jannach, F. Narducci, [7] F. Radlinski, K. Balog, F. Diaz, L. Dixon, B. Wedin, C. Pomo, Conversational recommendation: The- On natural language user profiles for transparent oretical model and complexity analysis, Informa- and scrutable recommendation, in: Proceedings tion Sciences —in press— (2022). doi:https://doi. SIGIR ’22, 2022, pp. 2863–2874. org/10.1016/j.ins.2022.07.169. [8] G. Adomavicius, Y. Kwon, New Recommendation [22] A. Felfernig, G. Friedrich, D. Jannach, M. Zanker, Techniques for Multicriteria Rating Systems, IEEE Constraint-based recommender systems, in: Rec- Intelligent Systems 22 (2007) 48–55. doi:10.1109/ ommender systems handbook, Springer, 2015, pp. MIS.2007.58. 161–190. [9] P. Viappiani, C. Boutilier, Regret-based optimal rec- [23] D. Jannach, Advisor Suite – A knowledge-based ommendation sets in conversational recommender sales advisory system, in: ECAI ’04, 2004, pp. 720– systems, in: RecSys ’11, 2009, pp. 101–108. 724. [10] T. Mahmood, F. Ricci, Improving recommender [24] S. Colucci, T. Di Noia, E. Di Sciascio, F. M. Donini, systems with adaptive conversational strategies, M. Mongiello, Concept abduction and contraction in: Proceedings of the 20th ACM conference on for semantic-based discovery of matches and nego- Hypertext and hypermedia, 2009, pp. 73–82. tiation spaces in an e-marketplace, Electron. Com- [11] J. Reilly, J. Zhang, L. McGinty, P. Pu, B. Smyth, A mer. Res. Appl. 4 (2005) 345–361. comparison of two compound critiquing systems, [25] P. Gärdenfors, Belief revision and knowledge repre- in: IUI ’07, 2007, pp. 317–320. sentation, in: Proceedings of the Sixth Conference [12] P. Grasch, A. Felfernig, F. Reinfrank, Recomment: on Theoretical Aspects of Rationality and Knowl- Towards critiquing-based recommendation with edge, 1996, p. 117. speech interaction, in: Proceedings of the 7th ACM [26] T. D. Noia, V. C. Ostuni, P. Tomeo, E. D. Sciascio, Conference on Recommender Systems, RecSys ’13, Sprank: Semantic path-based ranking for top-N 2013, pp. 157–164. recommendations using linked open data, ACM [13] F. Narducci, M. de Gemmis, P. Lops, G. Semeraro, Trans. Intell. Syst. Technol. 8 (2016) 9:1–9:34. Improving the user experience with a conversa- tional recommender system, in: International Con- ference of the Italian Association for Artificial In- telligence, Springer, 2018, pp. 528–538. [14] A. Iovine, F. Narducci, G. Semeraro, Conversa-