Mixed-Modality Interaction in Conversational Recommender Systems Yuan Ma, Timm Kleemann and Jürgen Ziegler University of Duisburg-Essen, Duisburg, Germany Abstract Recent advances in natural language processing have made modern chatbots and Conversational Rec- ommender Systems (CRS) increasingly intelligent, enabling them to handle more complex user inputs. Still, the interaction with a CRS is often tedious and error-prone. Especially when using written text as the form of conversation, the interaction is often less efficient in comparison to conventional GUI- style interaction. To keep the flexibility and mixed-initiative style of language-based conversation while leveraging the efficiency and simplicity of interacting through graphical widgets, we investigate the de- sign space of integrating GUI elements into text-based conversations. While simple response buttons have already been used in chatbots, the full range of such mixed-modality interactions has not yet been investigated in existing research. We propose two design dimensions along which integrations can be defined and analyze their applicability for preference elicitation and for critiquing the CRS’s responses at different levels. We report a user study in which we investigated user preferences and perceived usability of different techniques based on video prototypes. Keywords conversational recommender systems, user interface, preference elicitation, critique-based recommen- dations 1. Introduction In recent years, conversational styles of interaction have increasingly been applied in the field of recommender systems. Conversational Recommender Systems (CRS) aim to provide a more human-like and more comprehensible form of eliciting users’ preferences and recommending suitable items [1]. While many e-commerce sites present recommendations in a static fashion allowing little or no user interaction, the need for more flexible and personalized ways of providing recommendations is increasingly recognized. For this purpose, CRS are utilized, whereby a virtual agent interacts with the user [2]. A variety of techniques has been explored for providing conversational interaction with a recommender, some systems, for example, follow a strict rule-based process requiring the user to answer questions with predefined answers [3], while other approaches are designed to mimic a natural conversation in which users can freely formulate their questions and answers [4]. Rather than providing one-shot recommendations with only limited user intervention, CRS enable users to respond to the recommendations they receive, to criticize them, or to provide IntRS’21: Joint Workshop on Interfaces and Human Decision Making for Recommender Systems, September 25, 2021, Virtual Event " yuan.ma@uni-due.de (Y. Ma); timm.kleemann@uni-due.de (T. Kleemann); juergen.ziegler@uni-due.de (J. Ziegler) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) more precise indications of their own preferences in interactive conversations with the virtual agent [4]. The mode of interaction in current CRS is predominantly text-based where textual input by the user is followed by textual responses. In some cases, however, CRS also offer the user a set of potential answers in a GUI style, presenting the options as buttons. Augmenting the text-only interaction by GUI elements serves two purposes: (1) it provides users with a more efficient input technique and (2) it reduces the number of misrecognitions which may occur when just relying on typed (or spoken) input. While combining textual and GUI interaction in a CRS can increase its effectiveness and usability, the design space of possible multimodal CRS interactions has not yet been sufficiently explored yet and there is a number of options that have not been investigated or even considered in CRS research. In this paper, we aim at providing a first, more complete investigation of the different ways how textual interaction can be combined with GUI-like interactions. Keeping the flexibility of free text interaction, we introduce and investigate a variety of additional interactions by which the textual interaction may be augmented. The options investigated include directly changing (critiquing) features of an item shown as recommendation in the dialog as well as interaction with the textual responses given by the system. We developed video prototypes for these interactions and investigated them in an online user study. The results provide initial insights into interaction methods that users may prefer. 2. Related Work CRS exhibit a number of potential advantages over conventional GUI-based recommenders. They can provide a more natural and less obtrusive way of obtaining information about the user’s preferences which is essential for generating personalized recommendations [5]. In contrast to upfront elicitation steps that are detached from the actual recommending, such as rating a number of sample items e.g. MovieLens,1 or completing initial interviews [6] or personality questionnaires [7, 8], the elicitation of needs and preferences can be smoothly integrated into the dialog flow, thus also mitigating the cold-start problem. Provided a sufficient level of language understanding on the part of the system, the expression of user intentions [9], preferences or even dislikes is more flexible in comparison to system-initiated GUI interactions and closed- form questions. This flexibility may come, however, at the cost of lower efficiency, especially when users need to type their questions and responses instead of just clicking one of several pre-defined options. Therefore, CRS should aim to achieve an acceptable flexibility-efficiency trade-off [10]. A further, and so far less considered aspect of CRS is their capability to provide means for critiquing the system’s recommendations, thus increasing user control over the recommenda- tions. According to Chen and Pu [11] three critiquing approaches can be distinguished: natural language dialog-based critiquing (NLC), system-suggested critiquing (SC) and user-initiated critiquing (UC). NLC as a specific form of conversation, either text-based or voice-based, is well compatible with the general CRS approach. NLC can be performed in a human-like style, simulating, for instance, the conversation with a salesperson (e.g. ExpertClerk [12]). SC on the other hand provides system-initiated critiquing options and asks users for one or more 1 http://www.movielens.org responses, such as, for example, in multi-attribute utility theory (MAUT)-based Compound Critiques [13]. The benefits of SC lie mainly in their ability of guiding users concerning relevant and acceptable feedback criteria, so that the system can better understand users and enhance its recommendation effectiveness. UC offers users a more user-initiated form of critiquing, such as in Example Critiquing [14]. The main advantage of UC is that it allows for a higher level of user control. Also, hybrid critiquing techniques have been proposed [11] and compared [15]. By means of dialogs in CRS, both UC and SC may be combined and used flexibly. SC and UC are often realized using graphical user interface elements, such as buttons, sliders and checkboxes to either respond to system questions (SC) or to change properties of a recommended item (UC). There exist a large and increasing variety of techniques for realizing CRS [2]. Recent ap- proaches for improving CRS performance include, among others, knowledge graph-based methods [16], contextual bandits [17], bandit approaches unifying items and features [18], or topic guided methods [19]. A recent survey [1] provides a good overview of the current status of CRS. Most of the recent works are focused on the underlying methods and algorithms, providing users with a text-only form of interaction. However, multi-turn conversations can be built on any form of interaction or mixed-modality interactions instead of merely textual form [20]. Only limited research has thus far focused on the question how conversational interactions in recommender systems using different modalities impact CRS performance and, in particular, users’ perception of a CRS. Several works have investigated different interaction methods, although in other domains. For example, Ciechanowski et al. [21] evaluate different interactions to avoid the uncanny valley effect in chatbots, their interactions including text, voice with human-like avatar animations. In their study, text-based interaction was considered to be more pleasant by the participants, compared to voice interaction with a human-like avatar. The combination of digital assistants and CRS has been investigated recently, here, the results indicate that a combination of buttons and natural language is particularly beneficial [22]. Jin et al. [23] conducted an experiment to explore the correlation between the user’s interaction and personal characteristics. What is interesting for our study is that they deploy several interactive methods in MusicBot: text, voice, button, radio button (ratings). From their results, one can see that participants used buttons most frequently, then radio buttons, followed by text and voice. This indicates that text-only interaction in a CRS might not be the most useful and preferred technique, providing a motivation for our research presented here. Valério et al. [24] performed a comparison of different chatbot interaction paradigms. Chatbot Kino used only text to communicate with users, while the alternative chatbot Cinemito used text in combination with buttons and images for providing quick feedback. Their analysis revealed that there is not a clearly preferred way of interaction. Their work was a qualitative study (n = 10) and mainly focused on user’ perception, aiming provide design guidance for chatbots. The study presented in this paper extends existing work by focusing on conversational recommenders, by introducing mixed-modality interaction in CRS and by providing empirical evidence of the benefits of combining interaction modalities for preference capture and critiquing. 3. Mixed-Modality Interaction in CRS Combining different modalities in human-system interaction can generally bring about various benefits such as increased efficiency of the interaction or better disambiguation in probabilistic input recognition. Text-based or speech-based dialogs provide the user with a natural and flexible interaction style which, however, is also error-prone and often not very transparent since the user needs to anticipate the comprehension capabilities of the system to avoid misinterpretation or rejection of the input. When the input options are limited, selecting from the available options is also mostly quicker, resulting in higher efficiency and often reduced user frustration. While multimodal interfaces may employ a wide range of different modalities [25], we focus in this paper on the visual channel and on the prevalent combination of input techniques in CRS which is textual, language-based dialogue combined with graphical interaction widgets. To distinguish this type of interaction from more general multimodal interfaces, we use the term mixed modality for this combination. Even though restricted in the number of modalities, the design space for interaction based on textual conversation with integrated GUI elements is an under-researched area. To more systematically explore the design options we propose two dimensions for characterizing the interactions. 1 2 3 4 Figure 1: Interaction styles in CRS. In (1), interaction with the virtual agent is exclusively text-based, while the other interaction methods shown allow responses via GUI: In (2) buttons are additionally provided to respond. Range sliders (3) can be used to define continuous values. In case more than one answer option may be submitted, the CRS provides checkboxes. The first dimension refers to the location where the GUI element is integrated in the conversa- tion flow. We call this the anchor dimension. An anchor can be located in or near the user input area where typically buttons are used as shortcuts for otherwise textual user responses. Widgets can also be attached to a presented recommendation itself, be that inline in the textual flow or in a separate recommendation area. Responding to such prompts is essentially equivalent to critiquing the recommendation since the widget will allow the user to change feature values, and thus user preferences, directly on the displayed item. As a third anchor, we propose to embed interactive elements directly in the textual output of the system. This way, the user can respond directly to terms that appear in the output, such as features mentioned or the intended usage of a product. Various options exist for making such feedback available, for instance, through links or embedded drop-down lists. We assume that the user can in all cases also respond by typing a textual question or response thus providing a flexible style of interaction. Depending on where the widget is located it can either serve for specifying preferences in the dialog or for changing, i.e. criticizing, general or item-specific values. The second dimension refers to the type of interactive element (widget dimension) integrated 1 2 Figure 2: Critique interactions. With inline critiquing (1) users have the possibility to change fea- tures directly within the question/answer of the virtual agent. Modifiable features are highlighted. By clicking on these highlighted words, a drop-down list appears from which other options can be selected. Item-based critiquing (2) allows users to critique the characteristics of features based on a recommended item. Here, the values of the features of the displayed item can likewise be changed by means of a drop-down list. in the conversation. Here, we consider the standard widgets which can be selected depending on the purpose and constraints of the input. Buttons, checkboxes, drop-down lists or sliders can be offered for responding to system questions, or be attached to a recommended item to show and modify its features. As a novel option in CRS, we propose to also make parts of the system output interactive by embedding links, drop-down lists, or buttons directly in the textual stream to let users react directly on questions, assumptions or suggestions made by the system. In the present study, we investigated six different interaction techniques with respect to their usefulness in CRS (Fig. 1 and 2). In the following section, we will discuss the different forms of providing feedback and critique in more detail. 3.1. Critiquing in CRS Item-based critiquing (Fig. 2 (2)) is a technique that has been gaining considerable interest in recommender systems research. Since recommendations may not meet user preferences, critiquing the features of a recommended item allows users to modify or incrementally refine their preference in an interactive fashion, thus also increasing their control over the recom- mendations provided [11]. In a CRS, the critiquing approach can be extended to also providing feedback on other concepts that appear in the conversation, such as the system’s assumptions about the intended usage of an item, or any other aspect of the user model that is explicitly mentioned in the system output, Integrating item-based critiquing in CRS could help users conveniently supplement or modify preference when the first recommendation occurs. Besides the limitation of flexibility, another drawback is the learning cost of interaction, users need time to adapt to this interaction. 3.1.1. Inline critiquing We propose a novel inline critiquing interaction by which users can conveniently state and modify their preferences. The basic idea is that once the system presents a recommendation to the user, it also simultaneously generates a response, summarizing relevant item properties as well as user preferences collected so far. In this summary, some keywords are marked that can be directly modified by the user in the text. This avoids the problem that users would have to refer verbally to previous system responses to criticize their content in a purely text-based interface. This method can, in principle, be applied both to previous system outputs or user inputs. We assume that this form of feedback provides advantages with respect to efficiency as well as error-avoidance. In the following, we describe three different styles used in our study of emphasizing interactive keywords and the widgets used for providing feedback. 1 2 Figure 3: Alternative inline critiquing styles. Alternative display styles to avoid the risk of confusion with hyperlinks. In (1), changeable feature values are emphasized in the text by an eye-catching back- ground color. In 2, the common design pattern for drop-down lists is used to directly illustrate that further options are available. Underlined Text Fig. 3 (1) shows the first style of critiquing textual elements. The keyword is underlined and shown in a different color, as is common for visualizing web links in HTML. Once the user clicks on this link, a list pops up showing selectable—possibly popular—options. Other feedback can be freely entered in an editable text field. The technique is well-known to most users, although some might misinterpret it as a Web link. Highlighted Text The second style of indicating interactive keywords uses highlighting, showing the text with a colored background (Fig. 3 (2)). Highlighting can easily attract user’s attention and different from underlined text, it does not have hyperlink misunderstanding problem. However, users may interpret it as indicating importance, not interaction. Drop-down Button Fig. 3 (3) shows the third style of inline critiquing we investigated. This form inserts a button in the text indicating a drop-down list which has equivalent functionality as the other two styles. Buttons are easily recognizable as interactive objects and avoid the problem of potential misinterpretation as is the case with the other two styles, but may look awkward inside a running text. 4. Evaluation In this section, we describe an empirical comparison of the interaction patterns described in the previous section and their meaningful combination in a mixed-modality CRS against a conventional CRS. We conducted a user study to determine whether there exist preferences for the different techniques and critiquing styles when engaging with conversational agents. In particular, we intended to investigate whether users prefer a purely text-based CRS (TBCRS) or a mixed-modality interaction CRS (MICRS). Besides, we were interested in identifying which interaction modes are favored for different communicative tasks. 4.1. Method In order to investigate these questions, we performed a study using video prototypes of the different techniques described. To obtain a deeper understanding of users’ perception of the techniques, our study was designed to capture quantitative data as well as qualitative feedback from the participants. We split our study into three parts to investigate these research questions. 4.1.1. Comparison between TBCRS and MICRS Since the focus of this study was centered on obtaining initial insights about users’ preferences and the perceived usability of different interaction patterns, we did not yet implement a working CRS with the interaction integrated. Instead, the evaluation was done by means of videos showing the interactions based on a conversational recommender scenario in a fictitious bicycle shop. We created two videos exemplifying different levels of interaction with a fictitious online bicycle CRS. During the first part of the experiment, we presented these two videos to the participants, showing conversations with the CRS in the form of a chatbot. In the videos, a fictitious user tried to find a suitable bicycle for himself by means of the chatbot. Both videos were identical in their content, only the user’s interaction possibilities with the chatbot varied as follows: 1. Text-based CRS (TBCRS): The conversation with the chatbot is solely text-based. Besides the text-based method, there is no alternative option to respond to the chatbot’s questions.2 2. Mixed-modality Interaction CRS (MICRS): The conversation with the chatbot is both text- based and via direct feedback using i.e. buttons, drop-down lists. Each of the interaction methods described in Section 3 are exhibited. For all actions that the fictive user has to perform in the conventional system through text input, there is an alternative interaction possibility in this version. However, at all times, the user is able to enter simple textual input instead.3 Both videos were presented to all participants. Participants were allowed to pause, resume and restart the videos at any time. There was no time limit for watching the videos. We counterbalanced the order of the videos, resulting in a within-subject design. After each video, participants were asked to fill in a questionnaire. If not indicated otherwise, all questionnaire items had to be answered on positive 1–5 Likert response scales. For this purpose, we asked them to imagine themselves interacting with the chatbot shown and to evaluate the interaction possibilities. To assess user interface satisfaction, we applied the factors of “overall reaction to the software” from the QUIS questionnaire [26], consisting of six items. These items were assessed by means of a polarity profile. Besides, we constructed nine items that 2 Video of text-based CRS interaction: https://intsys.info/tbcrs 3 Video of mixed-modality CRS interaction: https://intsys.info/micrs were specifically intended to evaluate the interaction methods shown. Furthermore, we assessed domain knowledge of participants with self-constructed items and collected demographic data. 4.1.2. Interaction Methods in Detail During the second part of the study, we sought to obtain more detailed feedback on the six different interaction methods described in Section 3: Free text, buttons, checkboxes, sliders, item-based critiquing and inline critiquing. Here, we successively showed participants the individual interaction methods as screenshots. All interaction methods were already shown in the videos during the first part of the study, thus participants have already seen the interaction process with the respective method. We asked them to rate each interaction opportunity by means of the self constructed questions regarding enjoyability, supportiveness, efficiency and precision. Besides, we asked a specific question regarding critiquing efficiency for the interaction methods free-text, inline critiquing as well as for the item-based critiquing method. Additionally, we asked what they particularly liked or disliked about each interaction method. This optional questions were open-ended. 4.1.3. Inline critiquing styles Finally, we aimed to identify the preferred presentation style for the inline critiquing method. Therefore, we asked participants to choose one of three different designs described in Section 4.1.3, as the most appropriate one for directly modifying features in the text. Additionally, they were asked to briefly describe why they selected a particular style. These two question were optional. 4.2. Participants We recruited 70 participants using Prolific,4 a tool commonly used for academic surveys [27], of whom 63 finished the study. We pre-selected Prolific users based on the following criteria to maximize quality: (1) participants should be fluent in English; (2) their success rate should be greater than 95 %; and (3) the survey should not be conducted on smartphones or tablets to ensure that the interaction methods shown in the videos and screenshots can be recognized easily. The average duration of the survey was 13.18 minutes (SD = 3.28) and each participant received a compensation of 1.25£ if they successfully completed the survey. In our analysis, we only considered participants who watched the videos completely, leaving us with 54 participants. Demography Out of 54 participants, 32 were female. Their age ranged from 18 to 81 (M = 35.2, SD = 14.29). The majority had a university degree (46.3 %), 27.8 % had a higher education entrance qualification and 11.1 % had a general certificate of secondary education. The majority originated from the United Kingdom (85.2 %). All other participants originated from South Africa (3.7 %) and further countries (11.4 %). The domain knowledge of participants was rather low (M = 2.16, SD = 1.12). 4 https://www.prolific.co 4.3. Results We present the quantitative and qualitative results of the comparison between the two different video prototypes: TBCRS and MICRS (Section 4.3.1), followed by addressing specific quantitative and qualitative evaluations of each interaction method separately (Section 4.3.2). Finally, we detail the user comments regarding the proposed inline critiquing styles described in Section 4.3.4. Therefore, we will quote exemplary statements made by participants. 4.3.1. Comparison between TBCRS and MICRS Tab. 1 shows the overall reaction statistics of the two tested CRS, which are derived from partic- ipants’ ratings of the QUIS questionnaire items and our self-constructed items. To determine whether there are differences between the two conditions, we performed a paired t-test. Unless stated otherwise, preconditions for this and subsequent calculations were met. We used an 𝛼-level of .05 for all statistical tests. Table 1 Results from the paired t-test (df = 53) between the two conditions. Higher values indicate better results. Values marked with * are significant at a level of 𝑝 < .05. The upper part of the table shows the items from the QUIS questionnaire, whereas the lower part of the table shows the self-constructed items for evaluating the two systems. TBCRS MICRS M SD M SD T p d terrible / wonderful 3.83 0.99 3.96 0.97 -0.880 .383 -0.120 difficult / easy 4.17 0.97 4.30 0.92 -0.806 .424 -0.110 frustrating / satisfying 3.50 1.26 3.89 1.08 -2.183 .033* -0.297 inadequate power / adequate power 3.81 1.20 4.06 1.04 -1.390 .170 -0.189 dull / stimulating 3.26 1.31 3.59 1.09 -1.685 .098 -0.229 rigid / flexible 3.63 1.22 3.50 1.26 0.693 .491 0.094 Messages from the chatbot which prompt for user inputs are 4.11 0.88 4.20 0.90 -0.637 .527 -0.087 clear. Learning to interact with the chatbot is easy. 4.39 0.69 4.31 0.80 0.704 .485 0.096 I liked the methods for interacting with the chatbot. 3.74 1.15 3.85 1.07 -0.685 .496 -0.093 The chatbot gives me the opportunity to react fast and easily 4.06 0.94 4.11 0.88 -.358 .722 -0.049 to its questions. It is easy to express which product features I want. 4.02 0.94 3.98 0.90 0.265 .792 0.036 With the chatbot I can always easily and efficiently articulate 3.93 1.01 3.83 1.04 0.552 .583 0.075 my requirements. It is easy to adjust my preferences. 3.89 1.08 4.15 0.90 -1.528 .132 -0.208 It is easy to understand why the chatbot is showing me the 4.02 0.42 4.24 0.80 -1.806 .077 -0.246 recommendations. It is easy to criticize the features of the shown recommendations. 3.54 1.16 3.35 1.05 1.043 .301 0.142 Except for the “rigid/flexible” item, the MICRS version shown in the video prototype received better average ratings in all elicited items from the QUIS questionnaire. For the item “frustrat- ing/satisfying”, we identified a significant difference between the two tested versions. For all other factors tested, we did not observe any significant differences between the two conditions. Furthermore, we could not identify any significant correlations between the values reported here and those reported in the remainder of this work and domain knowledge. Comparing the comments on each of the video prototypes, it seems that the participants appreciate that it feels “natural to interact” with the virtual agent in the TBCRS version: “I liked this advisor [TBCRS] better as it made me feel like I’m talking to a real person”. The virtual agent appears “more human” and “it felt more personal” than the MICRS variant showcased. Conversely, other participants worry that it might be more difficult to express requirements because “[. . . ] people don’t always know exactly what they want and it would be difficult to articulate properties efficiently” and “with free text input, it is difficult to know what answer the chatbot is probing for, which can lead to frustration.” In addition to the “potential for higher error margins for misunderstanding the customer,” some participants also mentioned difficulties in “[. . . ] reasoning for recommendations because it is less clear what information is considered.” The positive comments on the video prototype of the MICRS mainly refer to the increased effi- ciency (“It was very efficient and time-saving”; “I like how easy it is to fine-tune my preferences”) and possibility of easier specification of personal preferences: “I liked the given options, which saved time and gave ideas you might not have necessarily thought of.” Also, the participants perceived the interaction options used as “straightforward and self explanatory.” The reasons why participants disliked this prototype were primarily that its options were rather “specific and seemed less flexible.” Also, some participants were not aware that besides the suggested interaction methods, they could continue to provide open text input. Few of them stated that it “did not feel authentic.” Table 2 Results from the paired t-test (df = 53) for the self constructed items between the free text and GUI based responses (buttons, checkboxes and sliders). Higher values indicate better results. Values marked with * are significant at a level of 𝑝 < .05. Free text GUI-Responses Item Description M SD M SD T p d Enjoyability I like this kind of interaction. 3.61 1.32 4.22 0.84 -3.234 .002* 1.388 Supportiveness This interaction supports me in my search. 3.87 1.14 4.14 0.84 -1.702 .095 1.146 This interaction offers me the possibility to Efficiency articulate my requirements in an easy and 3.91 1.14 4.12 0.86 -1.299 .200 1.222 efficient way. This interaction gives me the opportunity Precision 4.04 0.95 4.16 0.76 -0.857 .395 1.059 to respond precisely to the chatbot’s output. 4.3.2. Free Text vs. GUI-Responses To compare the different interaction methods, we combined the assessed values for the GUI- based interaction methods (checkboxes, buttons and sliders) into one score. We did this because the text-only approach may be used universally, but not every GUI-based interaction method is equally suitable for all response types. We conducted a paired t-test to compare text-based input with GUI-based methods. As shown in Tab. 2 the GUI-based interaction method for responding to the virtual agent were rated consistently higher. However, we only observed a significant difference for enjoyability. In addition to the quantitative data, we also analyzed the participants’ comments for each of the interaction styles presented. Tab. 4 shows a meta analysis of the comments separated by each interaction style. On average, there were 35.6 positive comments followed by 13.0 neutral comments for the GUI-responses. On average, there were 4.7 negative comments for these interaction styles. All interaction methods received fewer negative comments than the solely text-based interaction. In fact, more than half of the comments on the text-based interaction were neutral or negative (Tab. 4). Participants commented negatively on the text-only input that “with open-ended responses there are so many ways to respond that I probably would be unsure that my answer would be interpreted correctly.” Others commented positively that “it felt more like writing with a real person” and that they “preferred this kind of input because it is more descriptive.” Positive comments concerning the GUI-responses often referred to the simplicity (“very clear and helpful”, “straightforward”) and precision of the input: “It is easy to select what you want instead of typing and potentially making a typo which could impact results.” However, other participants noted that these interactions were “potentially limiting” and “narrow”, and that “they may not cover all possible responses.” For other participants, these forms of interaction were “[. . . ] too similar to conventional filtering systems in online stores.” Table 3 Results from the Repeated Measures ANOVA for self constructed items. Higher values indicate better results. Values marked with * are significant at a level of 𝑝 < .05. dfn indicates degrees of freedom numerator. dfd indicates degrees of freedom denominator. For items marked with †, the Greenhouse– Geisser adjustment was used to correct for violations of sphericity. Free Text Inline crit. Item-based crit. 2 Item M SD M SD M SD dfn dfd F p 𝜂𝑝 Enjoyability 3.61 1.32 3.98 1.14 4.28 1.00 2 106 6.846 .002* 0.114 Supportiveness 3.87 1.12 3.91 1.15 4.26 0.98 2 106 4.140 .019* 0.072 Efficiency 3.91 1.14 3.89 1.09 4.28 1.02 2 106 3.965 .022* 0.070 Precision 4.04 0.95 3.94 1.12 4.24 0.93 2 106 1.892 .156 0.034 Critiquing Efficiencya † 3.70 1.18 3.85 1.14 4.13 1.12 1.74 91.99 2.534 .092 0.046 a Item description: This interaction gives me the possibility to criticize the displayed features of the recommendations in an easy and efficient way. 4.3.3. Free Text vs. Critiquing Methods Additionally, we performed multiple Repeated Measures ANOVAs to compare the text-based interaction with the inline and item-based critiquing variants. The item-based variant was assessed consistently better than the text-only and inline critiquing variants (Tab. 3). Regarding enjoyability, we found a significant difference between the tested conditions (Tab. 3). Post-hoc analysis revealed a significant difference (𝑝 = .003) between the interactions free- text and item-based critiquing (−0.667, 95 %–CI[−1.14, −0.19]). Post-hoc tests performed here and in subsequent results were Bonferroni-adjusted. In terms of supportiveness, results indicate a significant difference between the conditions (Tab. 3). Again, post-hoc tests revealed a significant difference (𝑝 = .030) between the free-text interaction and item-based critiquing (−0.389, 95 %–CI[−0.75, −0.03]). Furthermore, the efficiency of the interaction variants is evaluated significantly differently (Tab. 3). Compared to the free text and the inline critiquing variant, the item-based critiquing is rated better. However, the performed post-hoc tests only revealed a significant difference (𝑝 = .030) between item-based and inline critiquing (0.389, 95 %–CI[0.028, 0.75]). Although item-based had higher means for the last two items (precision and critiquing effi- ciency) shown in Tab. 3 than inline critiquing and text-based interaction, no significant differ- ences were detected. Table 4 Summary of the qualitative feedback for the various interaction methods. GUI-Responses Critiquing Comments Free text Button Checkbox Slider Inline critiquing Item-based critiquing # Neutral 17 15 12 12 18 12 # Positive 21 35 36 36 31 35 # Negative 16 4 5 5 4 7 Positive 38.9 % 66.8 % 58.5 % 64.8 % Negative 29.6 % 8.8 % 7.6 % 13.0 % The received comments regarding the different critiquing methods were also rather positive. Item-based critiquing received more positive comments than inline critiquing. Overall, inline critiquing received the most neutral comments (Tab. 4). Regarding the item-based critiquing option, participants liked the “[. . . ] ability to directly select different options” and to be able “[. . . ] to criticize options directly based on the given items.” Others appreciated that “previous specifications were already taken into account” as well as “[. . . ] being able to specify further features.” However, others were critical and noted that “less technological affine users could possibly be overwhelmed.” Some participants felt that inline critiquing was “not as seamless as the other GUI options”. In addition, some participants were critical that this interaction option might not be understood by everyone: “If people are not familiar with the Internet (e.g. the older generation) they may not understand how to use this.” Others experienced a similar problem in terms of the representation: “I would not have realized they were drop-down lists and assumed they were links [. . . ].” 4.3.4. Inline Critiquing Styles Next, we present the assessment results of the proposed inline critiquing styles. Of the 54 participants, 50 participants specified a favored style for inline critiquing. From this group, the majority (60.0 %) preferred the drop-down button (Fig. 3 (2)). Less often, the other two variants highlighted text (24.0 %; Fig. 3 (2)) and underlined text (16.0 %; Fig. 2 (1)) were chosen. The other 4 participants did not indicate a preference, but still provided comments on the styles shown. A 𝜒2 -Goodness-of-Fit-Test shows that there are significant differences between the observed frequencies (𝜒2 (2, 𝑁 = 50) = 16.485, 𝑝 < .001). Post-hoc analysis revealed significant dif- ferences between drop-down button and underlined text styles (𝑝 = .003), and between the drop-down button and highlighted text (𝑝 = .024). However, we could not detect any significant difference (𝑝 > .999) between the underlined text and highlighted text styles. Comments from those favoring the drop-down button stated that it is “[. . . ] obvious that one can click and alter something [. . . ]” in this variation. Others noted that it is “[. . . ] particularly obvious that there are additional options” and “[. . . ] it cannot be confused with a hyperlink.” Other participants stated, that “it appears as an option whereas the other styles might be missed.” Participants who favored the highlighted text style argue that it “[. . . ] stands out most” and it is “easiest” and “clearest to see without no thought, because of the words highlighted with the background color.” Those participants who favored the variant with the underlined text justified their decision by stating that “[. . . ] it looks like a hyperlink and thus makes it clearer that one can click on it.” Others noted that this inline critiquing style “is simple and less cluttered.” Participants who did not nominate a favorite justified this by stating that all three styles shown are “useful when there are a lot of options to choose from” or that all variations “are clearly illustrating options.” 4.4. Discussion In this section, we discuss the results of our empirical study. Therefore, we first elaborate on the comparison between the two conditions and subsequently discuss the findings regarding the various interaction methods. 4.4.1. Comparison between TBCRS and MICRS First, the comparison of the two video prototypes showed slightly better scores for the MICRS condition on most factors, however, using paired t-test, we were only able to observe a significant difference for one factor (frustrating/satisfying). These results are in line with the participants’ comments. Here, they rated the MICRS as less error-prone. They also noted that it is simpler to recognize possible options and to specify their preferences. Similarly, the ability to respond quickly to questions is also rated slightly better. As expected, entering feedback through buttons or drop-down lists is faster and easier in comparison to formulating text—presuming that adequate response mechanisms are available respectively that the displayed interaction element is purposeful. The reason why the MICRS condition is perceived as more stimulating than the comparative condition may also be explained by the observation that participants discover aspects and potential options that they had not considered before. Conversely, the provided options may also lead to the MICRS condition being seen as more rigid. Some participants noted that it would be nice to still be able to submit open text input if, for instance, none of the available options apply. Although this was possible in the prototype video shown, the option may not have been obvious enough. However, in a real system, a text-based input option should remain available to avoid restricting users unnecessarily. This, in turn, can also ensure a more flexible conversation and exploit the potential strengths of an open-ended CRS. While participants rated the text-only condition in the questionnaires as easier to learn, this partially contradicts the comments provided. Although it may be obvious how to interact with a purely text-based system, it may still be necessary to learn how the virtual agent interprets the input to avoid misunderstandings. The ratings regarding comprehensibility and interpretability (“Messages from the chatbot which prompt for user inputs are clear”; “It is easy to understand why the chatbot is showing me the recommendations”) are consistent with the provided comments. Here, the MICRS condition is rated better. Due to the highlighted features within the text along with the explicit inputs via i.e. buttons or drop-down lists, it is clearly comprehensible which information the system uses for providing recommendations. Compared to the text-only approach, users receive more visual information throughout the entire conversation. However, we assume that the tendencies observed in our video prototype based analysis could be manifested in an interactive prototype with which participants can interact and thus leverage the features themselves. 4.4.2. Free Text vs. GUI-Responses While comparing the two systems, we asked the participants to rate the individual forms of interaction shown. Unlike in text-based interaction, an appropriate GUI option must be presented by the system depending on the logical type of response requested. Since the system demonstrated in the video prototype utilized a set of different GUI elements, we aggregated the GUI response options (buttons, checkboxes and sliders) and compared them to the text-based interaction. In the video prototype, we always provided the appropriate GUI response methods. In terms of enjoyability, the GUI based interaction methods were rated significantly higher, which is in line with the results of the general evaluation. One explanation might be that the use of various input methods is more interesting and thus the interaction is more enjoyable. Concerning the factor supportiveness, no significant differences were found, although there was a tendency for GUI responses to be rated better. The comments discussed in the previous section support these findings: Non-textual interaction methods seem to support users better, assuming that appropriate options are available. For the last two factors tested, there were also positive tendencies with regard to the GUI responses. These are rated slightly better than free-text input in terms of efficiency and sim- plicity, although these differences are not significant. This also corresponds to the participants’ comments. As long as the appropriate answer choice can be provided directly, a GUI interaction is considered more efficient since only one click is needed. Although it is reasonable to assume that appropriately displayed GUI elements would provide more precise feedback, results were comparable. We suspect this is because free text input allows requirements to be expressed that are not presented as options in GUI responses. In case users already have a clear idea of the desired item and which requirements they intend to communicate to the system, they may not require guidance in the form of GUI response options, but can respond more precisely and flexibly with a free, textual interaction. 4.4.3. Free Text vs. Critiquing Methods When comparing the text-based input with the two critiquing methods, results indicated that the item-based method was rated better than the purely text-based input method in all tested factors. With regard to the factors enjoyability, search support and ease of articulating requirements, the differences between these two interaction methods were significant. We assume that it is more enjoyable for users to give feedback directly based on specific items than to articulate them in a text. Perhaps when critiquing features of a particular item, the implications of that critique are less ambiguous. Although the inline critiquing was rated better than the text-only method in some aspects, the differences were rather minor. We suspect that the participants were not entirely aware of how this interaction method was supposed to work. This may be due to the chosen visualization, but also to the fact that this novel interaction method was not sufficiently explained within the video prototypes. While we did not find any significant differences in the other factors tested, we suspect that the alternatives to text-only interaction may still have advantages. The video prototype method we chose was possibly not capable of identifying them clearly. Here, however, we must take into account the fact that text only feedback may provide more accurate responses if the options provided by the CRS are not what the user expects. 4.4.4. Inline Critiquing Styles Finally, we discuss the results for the different inline critiquing styles. Although one might assume that all three styles perform similarly, since they all have a prompting character, the participants clearly preferred the drop-down button. We assume that this is mainly due to familiarity with this technique and its clear affordance for changing values. In contrast, emphasizing active parts of the text by underlining may be confused with a hyperlink—causing users to assume that clicking on it will forward them to another page. This was also reflected in the participants’ comments. Highlighting text with a colored background may not convey clearly enough that it is possible to interact with it and modify options. Rather, users might interpret this highlighting as an indication of importance or as a reference to a help text that appears when hovering the mouse pointer over the highlighted word. 5. Conclusions and Future Work We investigated a mixed modality interaction approach for CRS and could show in a user study that is positively evaluated by the participants who appreciated the benefits of using diverse interaction techniques within a CRS. Also, the possibility to criticize individual features of the recommended items directly in a CRS as well as the proposed inline text critiquing method was evaluated positively. Additionally, the non-textual interaction methods were particularly favorably evaluated. Overall, this study suggests that text-only interaction might not be optimal for creating a positive user experience in a CRS. Instead, a combination of different interaction methods is probably preferable. By summarizing and emphasizing relevant item features and user preferences in the text, the explanatory value of CRS responses are probably enhanced, increasing transparency of the system. Enabling users to modify terms directly in the output may also increase the sense of user control. Considering that CRS systems should be accessible from the very first use without detailed instructions, we believe the approach is promising and aim to focus in future work on an easy-to-understand embedding and traceability of changes resulting from applying the inline text and feature critiquing mechanisms. As a limitation of this work, we are aware that evaluating video prototypes cannot substitute interacting with realistic interactive prototypes. Therefore, we intend to investigate the use of mixed-modality interaction in CRS by implementing a fully interactive prototypes in future work. A particular challenge for building mixed-modality CRS is the question how the interactive options offered to the user can be derived automatically. Potential approaches might be based on leveraging knowledge graph data or information extracted from item descriptions or reviews. Also, suitable response generation techniques are needed that summarize the features the user is likely to criticize in the next interaction step. Furthermore, it will be interesting to explore techniques that can be applied to automatically decide which interaction method is most suitable in a certain conversational context. References [1] D. Jannach, A. Manzoor, W. Cai, L. Chen, A survey on conversational recommender systems, ACM Computing Surveys 54 (2021). doi:10.1145/3453154. [2] Y. Sun, Y. Zhang, Conversational Recommender System, Association for Computing Ma- chinery, New York, NY, USA, 2018, p. 235–244. doi:10.1145/3209978.3210002. [3] J. Weizenbaum, Eliza—a computer program for the study of natural language communica- tion between man and machine, Commun. ACM 9 (1966) 36–45. doi:10.1145/365153. 365168. [4] K. Ramesh, S. Ravishankaran, A. Joshi, K. Chandrasekaran, A survey of design techniques for conversational agents, 2017, pp. 336–350. doi:10.1007/978-981-10-6544-6_31. [5] B. Lika, K. Kolomvatsos, S. Hadjiefthymiades, Facing the cold start problem in recommender systems, Expert Syst. Appl. 41 (2014) 2065–2073. doi:10.1016/j.eswa.2013.09.005. [6] K. Zhou, S.-H. Yang, H. Zha, Functional matrix factorizations for cold-start recommenda- tion, in: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 315–324. doi:10.1145/2009916.2009961. [7] R. Hu, P. Pu, A comparative user study on rating vs. personality quiz based preference elicitation methods, in: Proceedings of the 14th International Conference on Intelligent User Interfaces, IUI ’09, Association for Computing Machinery, New York, NY, USA, 2009, p. 367–372. doi:10.1145/1502650.1502702. [8] W. Wu, L. Chen, Y. Zhao, Personalizing recommendation diversity based on user per- sonality, User Modeling and User-Adapted Interaction 28 (2018) 237–276. doi:10.1007/ s11257-018-9205-x. [9] W. Cai, L. Chen, Predicting user intents and satisfaction with dialogue-based conversational recommendations, in: Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization, UMAP ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 33–42. doi:10.1145/3340631.3394856. [10] M. F. McTear, S. Allen, L. Clatworthy, N. Ellison, C. Lavelle, H. McCaffery, Integrating flex- ibility into a structured dialogue model: Some design considerations, in: 6th International Conference on Spoken Language Processing, 2000. [11] L. Chen, P. Pu, Critiquing-based recommenders: Survey and emerging trends, User Modeling and User-Adapted Interaction 22 (2012) 125–150. doi:10.1007/ s11257-011-9108-6. [12] H. Shimazu, ExpertClerk: Navigating shoppers’ buying process with the combination of asking and proposing, in: Proceedings of the 17th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001, p. 1443–1448. [13] J. Zhang, P. Pu, A comparative study of compound critique generation in conversational recommender systems, in: V. P. Wade, H. Ashman, B. Smyth (Eds.), Adaptive Hypermedia and Adaptive Web-Based Systems, Springer Berlin Heidelberg, 2006, pp. 234–243. [14] L. Chen, P. Pu, Evaluating critiquing-based recommender agents, in: Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1, AAAI’06, AAAI Press, 2006, p. 157–162. [15] W. Cai, Y. Jin, L. Chen, Critiquing for music exploration in conversational recom- mender systems, in: 26th International Conference on Intelligent User Interfaces, IUI ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 480–490. doi:10.1145/3397481.3450657. [16] K. Zhou, W. X. Zhao, S. Bian, Y. Zhou, J.-R. Wen, J. Yu, Improving Conversational Recom- mender Systems via Knowledge Graph Based Semantic Fusion, Association for Computing Machinery, New York, NY, USA, 2020, p. 1006–1014. doi:10.1145/3394486.3403143. [17] X. Zhang, H. Xie, H. Li, J. C. S. Lui, Toward building conversational recommender systems: A contextual bandit approach, CoRR abs/1906.01219 (2019). arXiv:1906.01219. [18] S. Li, W. Lei, Q. Wu, X. He, P. Jiang, T.-S. Chua, Seamlessly unifying attributes and items: Conversational recommendation for cold-start users, ACM Trans. Inf. Syst. 39 (2021). doi:10.1145/3446427. [19] K. Zhou, Y. Zhou, W. X. Zhao, X. Wang, J.-R. Wen, Towards topic-guided conversational recommender system, in: Proceedings of the 28th International Conference on Compu- tational Linguistics, International Committee on Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 4128–4139. doi:10.18653/v1/2020.coling-main.365. [20] C. Gao, W. Lei, X. He, M. de Rijke, T. Chua, Advances and challenges in conversational recommender systems: A survey, CoRR abs/2101.09459 (2021). arXiv:2101.09459. [21] L. Ciechanowski, A. Przegalinska, M. Magnuski, P. Gloor, In the shades of the uncanny valley: An experimental study of human–chatbot interaction, Future Generation Computer Systems 92 (2019) 539–548. doi:10.1016/j.future.2018.01.055. [22] A. Iovine, F. Narducci, G. Semeraro, Conversational recommender systems and natural language: A study through the conveRSE framework, Decision Support Systems 131 (2020) 113250. doi:10.1016/j.dss.2020.113250. [23] Y. Jin, W. Cai, L. Chen, N. N. Htun, K. Verbert, MusicBot: Evaluating critiquing-based music recommenders with conversational interaction, in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 951–960. doi:10.1145/3357384.3357923. [24] F. A. M. Valério, T. G. Guimarães, R. O. Prates, H. Candello, Comparing users’ perception of different chatbot interaction paradigms: A case study, in: Proceedings of the 19th Brazilian Symposium on Human Factors in Computing Systems, IHC ’20, Association for Computing Machinery, New York, NY, USA, 2020. doi:10.1145/3424953.3426501. [25] A. Jaimes, N. Sebe, Multimodal human–computer interaction: A survey, Computer Vision and Image Understanding 108 (2007) 116–134. doi:10.1016/j.cviu.2006.10. 019, special Issue on Vision for Human-Computer Interaction. [26] J. P. Chin, V. A. Diehl, K. L. Norman, Development of an instrument measuring user satisfaction of the human-computer interface, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’88, Association for Computing Machinery, New York, NY, USA, 1988, p. 213–218. doi:10.1145/57167.57203. [27] E. Peer, L. Brandimarte, S. Samat, A. Acquisti, Beyond the turk: Alternative platforms for crowdsourcing behavioral research, Journal of Experimental Social Psychology 70 (2017) 153–163. doi:10.1016/j.jesp.2017.01.006.