1. Introduction

Strategic Conversations: LLMs Argumentation and User Perception in Movie Recommendation Dialogues

Valeria Mauro

Martina Di Bratto

Valentina Russo

Azzurra Mancini

Marco Grazioso

0 0 Logogramma S.r.l. , Naples , Italy 1 University of Catania , Catania , Italy

2025

This study investigates the persuasive and argumentative behaviors of two LLM-based chatbots, ChatGPT and Gemini, within the context of movie recommendation dialogues. Drawing on insights from argumentation-based dialogue and anthropomorphism research, we introduce a fine-grained annotation scheme to analyze chatbot strategies across dialogue phases. Through both linguistic analysis and user evaluation via ResQue and Godspeed questionnaires, we assess the systems' recommendation quality, perceived human-likeness, and strategic variation. Our findings reveal distinct conversational patterns: ChatGPT emphasizes afective engagement and trust-building, while Gemini adopts a more direct and eficiencydriven approach. These strategic diferences are also reflected in the quality of the recommendation and the user perception. Gemini excels in recommendation quality and explanations, while ChatGPT performs better in emotional engagement, transparency, and user satisfaction.

eol>Argumentation-based dialogue Conversational Recommender Systems Anthropomorphism

1. Introduction

CLiC-it 2025: Eleventh Italian Conference on Computational Linguistics, September 24 — 26, 2025, Cagliari, Italy [1] * Corresponding author. † These authors contributed equally. $ valeria.mauro@phd.unict.it (V. Mauro); mdibratto@logogramma.com (M. Di Bratto); vrusso@logogramma.com (V. Russo); amancini@logogramma.com (A. Mancini); mgrazioso@logogramma.com (M. Grazioso) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License evaluation protocol. Section 5 reports the results of our Attribution 4.0 International (CC BY 4.0). conversational pattern analysis and the questionnaire- only capable of dynamically adapting their suggestions based user study. In Section 6, we discuss our findings based on user behavior [ 12 ], but also of providing clear and possible future works. and meaningful rationales for their decisions. This contributes to perceived transparency, an important factor in fostering trust and understanding in human-AI inter2. Large Language Models and action [ 13 ]. Moreover, LLMs demonstrate the ability to Anthropomorphism monitor and reflect on user satisfaction, recognize behavioral patterns across interactions, and adjust their Chatbots like ChatGPT and Gemini are powered by Large recommendations accordingly [9]. This continuous adapLanguage Models (LLMs), which are trained on vast tation and reflective capacity make LLM-based chatbots amounts of textual data to learn the recurrent structures increasingly efective as customized, socially aware recand patterns of human language [5, 6]. Mediated by lan- ommenders, simultaneously blurring the line between guage but implying something beyond it, the social capa- tool and social agent in the eyes of the user. bilities of LLM-based systems enables them to simulate a range of human behaviors, thereby reinforcing users’ perceptions of them as human-like. Anthropomorphism, 3. Argumentation-Based indeed, refers to the tendency to attribute human charac- Recommender Dialogue teristics, behaviors, motivations, intentions, or emotions to non-human entities. It is a cognitive process that often Systems leads people to perceive such systems as more humanlike than they actually are. This tendency arises for several reasons. On a broad level, anthropomorphism is a natural and often automatic human response, driven by subtle cues in the system’s interface. It functions as a kind of cognitive shortcut: when users lack complete information about a non-human agent, they instinctively project human-like qualities onto it, drawing from readily accessible anthropocentric knowledge i.e., knowledge about themselves or about humans in general [7]. The medium of interaction itself (a dialogue system) makes a degree of anthropomorphism almost inevitable. Language-based interaction, turn-taking, and the adoption of roles typically played by humans are all fundamental triggers for anthropomorphic attributions. These are further reinforced when chatbots are given human-like personas, names, or presumed preferences [8]. Certain linguistic strategies amplify this efect. For instance, during recommendation dialogues systems often use expressions that suggest uniquely human experiences (such as claiming to have “watched” a movie) or employ first-person pronouns (“I”, “me”, “my” when expressing opinions about the previously mentioned item), which reinforces the illusion of human agency and subjectivity. LLMs can also engage in interactive explanations, respond to user feedback, and even emulate emotional responses and social cues [9]. These abilities are particularly significant in recommendation scenarios, where personalization is key to user satisfaction. Systems like ChatGPT and Gemini can tailor their responses to individual profiles, adapting to user preferences and communicative styles over time [ 10 ]. They can ofer context-sensitive recommendations and justifications, which are especially valuable when users are unfamiliar with the items being suggested [11].

Recent research highlights that these chatbots are not Coversational Recommender Systems (CoRS) have attracted considerable interest in recent years and are now a common feature of our everyday interactions with technology. They are built to enable smooth communication between people and machines, helping users perform tasks such as finding information and getting recommendations. A key aspect of dialogue systems in general is the use of argumentation, which plays an important role in their functionality [14]

Argumentation-based dialogue (ABD) deals with phenomena depending on the dynamic exchange of information, which can vary according to turns and participants. ABD studies often builds on Walton and Krabbe’s dialogue classification framework [ 15], which considers participants’ knowledge, their goals, and the rules guiding the conversation [16]. They define six dialogue categories, such as Information Seeking, Persuasion, Deliberation, Negotiation, and Eristic. Identifying the dialogue type is especially helpful in analyzing efective dialogue moves to achieve communication goals, particularly in human-machine interactions. We chose the recommendation task since it is well-suited for evaluating the argumentation process in a human-machine interaction, thanks to its inherently dialogical nature and clear objective. It typically follows a two-phase structure, Exploration and Exploitation (E&E). In the exploration phase, the system seeks new information, while in the exploitation phase, it leverages the most promising known option[17].

The Exploration phase can be associated with Walton’s Information Seeking dialogue, or more specifically, the Information Sharing type, as in real dialogues the situation of lacking knowledge is often dynamic rather than static [18, 19]. The Exploitation phase, on the other hand, aligns with the deliberation dialogue, a cooperative form of interaction in which participants work together to find Participants were then presented with a detailed set of a solution to a shared problem while considering every- instructions on how to use ChatGPT or Gemini and how one’s interests [20]. In this context, argumentation plays to share their conversations. Users were free to interact a key role in proposing solutions, supporting them with with the bots without any conversational constraints. reasons, and evaluating alternatives [21], all essential After completing the task using the assigned system, they features for CoRS. This is especially relevant today with submitted the link to their chat in the designated field. the advent of LLMs: integrating computational argumen- A demographic survey followed, collecting informatation formalisms could help address challenges such as tion on gender, age, education level, and prior experience the lack of explainability, transparency, and governabil- with the chatbot. ity [22, 23], thus maintaining a trustworthy perception Finally, participants completed the adapted ResQue among users. The aim of this work is to investigate the [24] and Godspeed questionnaires [25, 26]: the former to behavior of LLM-based chatbots in recommendation sce- evaluate the quality of the recommendation, the latter narios, evaluating diferences and similarities in their for perceived anthropomorphism. argumentation strategies, and assessing, through human evaluation, the quality of the recommendations and the 4.1. Dialogue annotation scheme perceived anthropomorphism, as well as whether these aspects correlate with the identified argumentation strategies.

The annotation scheme builds on the existing literature

while introducing novel extensions. The units of analysis are dialogical moves, clusters of words or dialogue segments expressing a communicative intention [18, 27]. 4. Data collection & methodology A move typically corresponds to a single dialogical turn, though a turn may employ multiple strategies to pursue In this study, we decided to evaluate two LLM-based chat- subgoals. We deployed a set of category for the recombots in the movie recommendation domain: Gemini and mender’s and seeker’s utterances. This means that the ChatGPT. More specifically, our objective was to eval- annotation scheme encompasses eighteen and nineteen uate the systems’ performance as recommenders and, categories, respectively. The category annotation scheme more broadly, as human-passing interlocutors through is twofold. To account for the recommender’s strategies user ratings. Participants assessed both the quality of (i.e., the chatbot’s), twelve strategies were initially sethe recommendations and their perceptions of anthro- lected from Hayati et al. [28], who defined this tagset pomorphism, likeability, and intelligence. A between- in the context of human-human interaction. The first subjects design was chosen to avoid carryover efects eight are sociable strategies aimed at building rapport and to reduce the cognitive load and fatigue associated with the seeker: Personal Opinion (PO), used by the recwith completing the same questionnaire twice, which is ommender to share subjective views about a movie, such common in within-subjects designs. Participants were as opinions on the plot, actors, or other elements; Permainly recruited from the BA and MA programs of the sonal Experience (PE), used by the recommender to share Department of Humanities at the University of Catania. personal experiences related to a movie (e.g., mentioning The most represented age group is that of participants they’ve watched it several times) in order to persuade under 30, accounting for 87.8% of those who took part the seeker; Similarity (S), used to express empathy and in the ChatGPT test and 92.5% of those in the Gemini alignment with the seeker’s preferences, creating a sense test. The survey was administered via Google Forms, and of like-mindedness and building trust; Encouragement data collection took place over approximately one month, (E), used to praise the seeker’s taste and encourage them from early February to mid-March 2025. A total of 95 to watch the recommended movie; Ofering Help (OH), participants took part in the study, resulting in 81 con- used by the recommender to explicitly express an inversations correctly submitted via the designated input tention to help the seeker or to be transparent about box, comprising 2,362 dialogue turns overall1. The study their recommendations; Preference Confirmation (PC), procedure followed these steps: Participants read a brief used by the recommender to ask about or rephrase the introductory statement outlining the task (i.e., prompting seeker’s preferences, making their reasoning process exa film recommendation from ChatGPT or Gemini in a plicit; Credibility (C), used by the recommender to display casual, conversational style). They were also informed expertise or trustworthiness by providing factual inforthat additional instructions would follow and that they mation about the movie (e.g., plot, cast, or awards), and would be asked to submit an anonymous link to their con- Self-Modeling (SM), used by the recommender to present versation. In order to proceed, participants were required themselves as a role model, for example by watching the to check two consent boxes on the same page. movie first to encourage the seeker to do the same. Two additional categories cover preference elicitation: Experi1https://github.com/marcograzioso/human-bot-recommendationdialogues-it ence Inquiry (EI), used by the recommender to ask about the seeker’s past movie-watching experiences, such as whether they have seen a specific movie; and Opinion Inquiry (OI), used to ask for the seeker’s opinion on specific movie-related attributes, such as their thoughts on the plot or the actors’ performances. Two functional labels are also included: Recommendation (R) and No Strategy (NS). The former (R) is intended as the final claim in the argumentation process, specifically a communicative act aimed at justifying a target claim [29]. The latter (NS) is used for phatic or neutral moves, such as greetings or backchanneling. Given the versatility of modern conversational AI systems like ChatGPT and Gemini, fully capable of posing technical questions across domains, we introduced six further categories to capture a broader range of preference elicitation strategies:

Sociable Strategies Personal Opinion Recommendation Personal Experience No Strategy

Similarity Streaming Service Inquiry Encouragement Genre Inquiry

Ofering Help Actor Inquiry Preference Confirmation Director Inquiry

Credibility Plot Inquiry

Self-Modeling Action Inquiry

Experience Inquiry Opinion Inquiry (i.e., the number of the conversation), the turn number (counted from the beginning of the dialogue), the author • Streaming Service Inquiry (SSI): the recom- (either the user or the chatbot), the dialogic move unmender asks about the seeker’s (i.e. the user) der analysis, and its corresponding label. A single turn preferred streaming platforms; may contain multiple dialogic moves, each annotated • Genre Inquiry (GI): the recommender asks separately.

about the seeker’s preferred genres; To evaluate annotation quality, a second annotator • Actor Inquiry (AcI): the recommender asks with linguistic background independently annotated 15% about favorite actors; of the total dialogue moves in the dataset. Inter-annotator • Director Inquiry (DI): the recommender asks agreement was then calculated using Cohen’s Kappa, about favorite directors; resulting in a score of 0.826, which indicates a high level • Plot Inquiry (PI): the recommender asks about of agreement between the two annotators.

preferred narrative or thematic features; • Action Inquiry (AI): the recommender prompts 4.2. User evaluation questionnaires the user regarding the next step in the conversation.

The evaluation constructs were adapted and translated

into Italian from two well-established models: the The last two categories require further clarification. Since ResQue questionnaire [24] and the Godspeed questiona movie inevitably involves a wide array of features that naires [25]. Together, these provide a robust, usercannot be fully captured by any single fine-grained strat- centered evaluation framework. The final questionnaire egy, Plot Inquiry (PI) was defined broadly. It includes consisted of 22 items corresponding to 16 constructs. All questions not only about narrative content but also about items were rated on a 5-point Likert scale. ResQue ofers a film’s perceived tone (e.g., “pure fun” vs. “deep”), cul- a concise yet powerful tool for assessing users’ perceptural status (e.g., “cult classic”), or recency. Action Inquiry tions, beliefs, attitudes, and acceptance of a recommender (AI), instead, accounts for the fact that even domain- system. Due to the study’s scope and time constraints, restricted dialogues can drift in topic. This label is as- we adopted the “short version” of ResQue, using one signed when the chatbot explicitly asks about the user’s item per construct. Two constructs (Recommendation intended course of action (for instance, “What would you Diversity and Interaction Adequacy) were excluded. The like to do now?”), a strong signal that the system is adapt- final ResQue-based questionnaire included 13 constructs ing to dynamic user needs, which may evolve during the and items. All original labels were preserved, except conversation. All the sociable strategies used to establish for Purchase Intention, which was renamed Behavioral the conversation are reported in Table 1. Intention to better reflect the study’s focus (Table 3).

To annotate the seeker’s strategies, eleven strategies From the Godspeed model, we selected three of the five grouped into four categories were initially adopted from original questionnaires: Anthropomorphism (Godspeed Di Bratto et al. [30]. However, the scope of this work I), Likeability (Godspeed III), and Perceived Intelligence is centered on analyzing the behavior of LLM-based (Godspeed IV) (Table 4). Two items were removed from chatbots in engaging conversations using argumentative each construct to streamline the questionnaire. Minimum strategies. Therefore, the analysis of seeker utterances coverage of the constructs’ theoretical domains is guarhas not been addressed. Table 2 reports a sample of an- anteed, as the items from each questionnaire are internotated dialogues. Each row includes the dialogue ID related. To ensure clarity and consistency, all Godspeed author user chatgpt chatgpt chatgpt user chatgpt chatgpt text Quale film mi consiglieresti? Dipende da cosa ti piace! Preferisci azione, thriller, fantascienza, horror, dramma o commedia? Oppure cerchi qualcosa di più specifico, tipo un film recente o un cult imperdibile? Fantascienza azione Ottima scelta! Ecco alcuni film di fantascienza d’azione che potresti apprezzare: "Mad Max: Fury Road" (2015) – Un’esplosione di adrenalina in un mondo post-apocalittico con inseguimenti folli e azione non-stop. category [null] OI GI PI [null] EG R, C

5. Results

semantic diferential scales were adapted to Likert-type items. This choice is supported by [31] who argue that Likert scales may improve response accuracy. Moreover, 5.1. Conversational Pattern Analysis given that ChatGPT and Gemini are disembodied agents, we either omitted or carefully rephrased terms that re- Once the annotation phase was completed, we performed fer to physical appearance in order to avoid ambiguity an analysis of the distribution of dialogue moves across in the Italian target language. For instance, the expres- 20 dialogue turns to compare Gemini and GPT persuasive sion "human-like", typically rendered in existing Italian strategies (Figure 1). The analysis reveals clear strategic translations as "dall’aspetto umano" (’with a human ap- diferences between the two LLM-based chatbots, Chatpearance’) [26] was considered potentially misleading GPT and Gemini, in their approach to persuading users when applied to text-based agents. Instead, we adapted to watch a movie. Both models exhibit a dominant rethe wording to better fit the nature of the evaluated sys- liance on the Recommendation (R) strategy, with Chattems and, for the same reason, chose to exclude Animacy GPT which tends to delay the exploitation phase giving (Godspeed II) and Perceived Safety (Godspeed V) from room to information gathering, while in Gemini we find our evaluation. For future analysis, would be useful to also R as primary move, along with the preference coladopt Item Response Theory (ITR)-based models [32]. lection.

These models ofer a principled way to address individ- This shared pattern suggests a common persuasive ual variability in Likert scale use by modeling latent traits architecture in which the models delay direct recomwhile accounting for person- and item-specific influences. mendations until initial rapport and exploration phases, Moreover, advanced IRT extensions such as multidimen- consistent with human-like persuasive communication sional and mixture models provide additional flexibility (see Di Bratto et al. [30] for the analysis of human recto handle systematic response biases. We believe this ommender strategies). methodological choice would strengthens the validity However, notable divergences emerge in the deployand fairness of our analysis and reduces bias due to dif- ment of other strategies. ChatGPT adopts a broader and ferential scale usage across respondents. more diversified strategy set in the early turns. It frequently uses Genre Inquiry (GI), Plot Inquiry (PI), PreferGODSPEED I: ANTHROPOMORPHISM 1. The chatbot seems natural. 2. The chatbot seems human-like. 3. The chatbot seems conscious.

Godspeed Questionnaire Items GODSPEED III: LIKEABILITY 1. The chatbot is friendly. 2. The chatbot is kind.

3. The chatbot is nice. ence Confirmation (PC) and Credibility (C) in the initial stages (Turns 3–4), indicating a deliberate efort to build social rapport and create a sense of trust by providing credible domain information and increasing perception as domain expert. This emotionally grounded approach is further supported by ChatGPT’s usage of Encouragement (EG), which enrich the persuasive context by portraying the bot as a cooperative and relatable interlocutor.

In contrast, Gemini shows a more focused and functional strategy for the exploration phase, that seems wider (it ends at turn 5). Here, Recommendation move (R) is accompanied by domain-specific inquiries such as Opinion Inquiry (OI) followed by Genre Inquiry (GI).

This indicates a deepening strategies in investigating user preferences to get more accurate information. The exploitation phase, on the other hand, presents rapportbuilding strategies such as self-Modelling (SM) and Encouragment (EG). Here, the broader tactical spectrum Figure 1: Comparison of dialogue move distributions for suggests a design that intertwines personalisation with Gemini (top) and ChatGPT (bottom), showing diferences in persuasion rather than staging them sequentially. communicative strategy usage.

Finally, the occurrence of No Strategy (NS) moves remains low for both models, even if ChatGPT seems to use them more at the beginning of the conversation. suggesting that both systems are perceived as moder

In summary, ChatGPT demonstrates a human- ately natural, with no clear advantage. In terms of percentered persuasive style, combining efective strategies ceived humanness, Gemini again scores higher than Chatto foster user alignment before making recommenda- GPT which has a more compressed boxplot leaning totions. Gemini, by contrast, exhibits a more direct and ward machine-like behaviour, indicating that participants utilitarian persuasion model, emphasizing information tended to view Gemini as more human-like in its outputs. delivery and content relevance over emotional alignment. This diference is the largest among the three considered These findings underscore the importance of strategic anthropomorphism-related dimensions and it may reflect variation in LLM-based recommendation systems and variations in argumentative strategies given the broader suggest difering design priorities: ChatGPT appears op- tactical spectrum employed by Gemini. Conversely, on timized for engagement and trust-building, while Gemini the awareness dimension, ChatGPT slightly outperforms emphasizes eficiency and relevance. Gemini, suggesting that users may attribute a marginally higher sense of intentionality or contextual sensitivity 5.2. Questionnaires results to ChatGPT. Moving on to Godspeed III, both systems received high ratings on the friendliness dimension, with The comparative analysis between Gemini and ChatGPT comparable medians, as the horizontal lines in the boxes in the context of movie recommendation and perceived are nearly aligned. Both models have multiple outliers anthropomorphism highlights notable diferences in user on the low end, i.e. data points that lie significantly outperception and interaction quality. As shown in Figure side the range of most other values in the dataset. This 2, Gemini and ChatGPT were rated similarly in the di- suggests that a few respondents rated both ChatGPT and mension of naturalness, with Gemini receiving slightly Gemini very low in friendliness. Gemini also outperhigher scores compared to ChatGPT which also received formed ChatGPT on kindness compared to ChatGPT that 2 and 3 evaluations. However, the diference is small shows extreme low values, indicating it was perceived as marginally more courteous. The largest gap in the Likeability subset emerges in the pleasantness dimension: Gemini has a distribution more centered, while ChatGPT shows more variability and more extreme negative cases. This diference may suggest that Gemini evokes a more consistently positive emotional reaction among Figure 4: Participants’ Ratings on the interaction with Chatusers, potentially linked to its conversational tone or GPT regarding recommendation quality afective cues. In terms of competence, Gemini again received slightly higher ratings than ChatGPT, showing less dispersion and suggesting that users viewed Gem- ofering explanations, ChatGPT was perceived as clearer ini as marginally more capable in fulfilling its role as a in making those explanations understandable (i.e., transconversational agent. A similar trend is observed in the parency), which may reflect a more accessible or userknowledgeability dimension, where Gemini frequently friendly communication style. In terms of Perceived Ease receives high scores, with few extremes. Although the of Use, ChatGPT was favored: it received higher scores diference is modest, it may imply that Gemini is per- for both task completion ease (Mean = 4.512 vs. 4.275) ceived as slightly more informative or better grounded and ease of communicating preferences (Control, Mean in its responses. Finally, both systems performed well = 4.525 vs. 4.325). This could reflect a smoother interacon the responsible dimension, with Gemini showing few tion flow or a greater ability to accurately interpret user outliers. These scores indicate that users generally found input. With respect to the perceived quality of recomboth systems to be reasonable and contextually appropri- mendations, Gemini was rated slightly higher in terms of ate in their responses. Overall, the ratings across these providing good suggestions (Perceived Usefulness, Mean dimensions suggest that both systems are perceived as = 4.125 vs. 4.00). However, ChatGPT performed better intelligent, with a slight and consistent advantage for in terms of Overall Satisfaction (Mean = 4.15 vs. 4.048). Gemini in terms of perceived cognitive abilities. The diference is minimal in building user Confidence and

Analyzing the quality of the recommendations (Figure Trust regarding the proposed choices (3.8 for Gemini vs. 3 and Figure 4), in terms of Recommendation Accuracy, 3.756 for ChatGPT). Finally, looking at future Use IntenGemini exhibits greater variability in the ratings, suggest- tions, ChatGPT clearly outperformed Gemini: it received ing that users perceived a better alignment between their higher ratings for willingness to reuse the chatbot (Mean preferences and the suggestions provided by ChatGPT. = 3.902 vs. 3.375) but not for likelihood of watching the However, Gemini outperformed ChatGPT in recommend- recommended films ( Behavioural Intentions, Mean = 3.658 ing novel films, which may indicate a stronger ability to vs. Gemini’s 3.825). Overall, the findings point to a baldiversify recommendations and introduce lesser-known anced competition between the two systems. Gemini’s content. Both chatbots were rated equally in terms of strengths lie in novelty and explanation, but ChatGPT is visual interface, indicating that the design did not sig- preferred for overall user experience and for encouraging nificantly influence user preference in this area. When continued engagement. it comes to Explanation, Gemini stood out more clearly: it received a higher score for explaining why specific iflms were recommended, and also slightly outperformed 6. Discussion & conclusions ChatGPT in terms of providing suficient information These findings support the notion that users tend to evalto make a viewing choice (i.e., Information Suficiency ). uate a recommender primarily based on its instrumental Interestingly, while Gemini was rated higher in terms of efectiveness. Likeability factors such as kind (gentile), should be acknowledged to contextualize the scope of friendly (amichevole), and nice (simpatico) clustered to- these findings. First, while the sample size is robust for a gether and improved the socio-emotional tone of inter- controlled experimental setup, it may still limit the genaction, but ofered smaller gains in the perceived quality eralizability of the results to broader user populations of the recommendation unless paired with a convincing with varying backgrounds, digital literacy, or cultural recommendation rationale. In this context, ChatGPT’s expectations regarding conversational agents. Second, early use of strategies such as preference confirmation, participants were exposed to a limited number of intercredibility statements, and encouragement signals an in- actions per system, which may not fully capture the dytention to build trust through a socially engaged and namic evolution of trust and satisfaction over extended emotionally grounded style. The more frequent use of use. Future studies could benefit from a longitudinal decredibility cues in ChatGPT’s discourse likely contributed sign that tracks user preferences, learning curves, and to its higher score in Transparency, as users may have behavioral outcomes across multiple sessions. Moreover, perceived its explanations as clearer and more accessi- the interpretation of constructs such as “human-like” or ble due to its habit of justifying claims with trustwor- “competent” is inherently subjective and may vary across thy or relatable references. However, this transparency individuals, even when standardized scales are used. The advantage may not have fully compensated for Chat- Likert-scale approach, while efective for comparative GPT’s comparatively lower performance in Explanation analysis, introduces the usual constraints of self-reported and Recommendation Novelty, where Gemini showed a measures, including social desirability bias and response stronger profile. Gemini’s conversational architecture centrality. Furthermore, it is important to recognize that made heavier use of Recommendation moves (R), typi- understanding behavioral diferences between chatbots cally delivered through a structure of claims followed by is inherently limited by their black-box nature: system supporting reasons. This discursive pattern may have prompts, fine-tuning strategies, and training data are typenhanced users’ perception of the system’s explanatory ically undisclosed. While such diferences might stem power, enabling them to better understand why specific from prompt design or fine-tuning, they could also resuggestions were made. Moreover, Gemini’s early deploy- sult from user behavior, as diferent dialogic strategies, ment of a deepening strategy (marked by domain-specific questioning styles, or interactional cues may influence inquiries such as Opinion Inquiry and Genre Inquiry) al- the model’s responses. In sum, the current findings oflowed it to gather more precise information about user fer meaningful evidence on how users perceive compepreferences before initiating recommendations and its tence, warmth, and recommendation quality across two more outcome-oriented conversational strategy appears state-of-the-art systems, but they should be viewed as to align with its stronger performance on Behavioural a foundation for further research rather than definitive Intention measures (i.e. users’ reported likelihood of conclusions. Larger and more diverse samples, longituwatching the recommended films). The system’s focus dinal protocols, and richer qualitative analyses will be on precision and justification may have reinforced users’ essential to deepen our understanding of how human-AI sense of efectiveness and goal-orientation, enhancing interaction unfolds in recommendation contexts. the perceived utility of the exchange. Conversely, ChatGPT received higher ratings for Overall Satisfaction and Future Use Intention. This may be partially attributed 7. Acknowledgments to its broader engagement strategy, which incorporates multiple rapport-building elements from the early stages This work is supported by the European Union - Next of the conversation, contributing to a smoother and more Generation EU under the Italian National Recovery and socially fulfilling experience. Furthermore, ChatGPT’s Re- silience Plan (NRRP), Mission 4, Component 2, Ingreater popularity and widespread familiarity likely bol- vestment 1.3, CUP E83C22004640001, partnership on ster its trustworthiness in users’ eyes. Familiarity breeds “Telecommuni- cations of the Future” (PE00000001 - proconfidence, and this reputational advantage may have gram “RESTART”). Valeria Mauro’s work is framed in translated into more favorable subjective evaluations, the context of the industrial internship of PNRR - D.M. even when objective recommendation quality was com- 118/2023, Inv. 4.1 Public Administration. parable or slightly lower. Taken together, the data indicate that while both systems ofer valuable features, their References strengths lie in diferent areas. Gemini excels in functional efectiveness, providing novel and well-justified [1] C. Bosco, E. Ježek, M. Polignano, M. Sanguinetti, recommendations, whereas ChatGPT leads in accessibil- Preface to the Eleventh Italian Conference on Comity, emotional engagement, and trust, likely amplified by putational Linguistics (CLiC-it 2025), in: Proceedits widespread cultural recognition. Several limitations ings of the Eleventh Italian Conference on Compu- preprint arXiv:2406.02377 (2024).

tational Linguistics (CLiC-it 2025), 2025. [14] H. Prakken, Historical overview of formal argu[2] T. Yang, L. Chen, Unleashing the retrieval potential mentation, in: Handbook of formal argumentation, of large language models in conversational recom- College Publications, 2018, pp. 73–141. mender systems, in: Proceedings of the 18th ACM [15] D. Walton, E. C. Krabbe, Commitment in dialogue: Conference on Recommender Systems, 2024, pp. Basic concepts of interpersonal reasoning, SUNY 43–52. press, 1995. [3] L. Friedman, S. Ahuja, D. Allen, Z. Tan, H. Sidahmed, [16] E. Black, N. Maudet, S. Parsons, ArgumentationC. Long, J. Xie, G. Schubiner, A. Patel, H. Lara, based dialogue, in: Handbook of Formal Argumenet al., Leveraging large language models in con- tation, Volume 2, College Publications, 2021, p. 511. versational recommender systems, arXiv preprint [17] C. Gao, W. Lei, X. He, M. de Rijke, T.-S. Chua, arXiv:2305.07961 (2023). Advances and challenges in conversational recom[4] Y. Deldjoo, J. Mcauley, S. Sanner, P. Castells, mender systems: A survey, AI Open 2 (2021) 100– E. Palumbo, S. Zhang, The 1st international work- 126. shop on risks, opportunities, and evaluation of gen- [18] F. Macagno, S. Bigi, Analyzing the pragmatic strucerative models in recommendation (roegen), 2024. ture of dialogues, Discourse Studies 19 (2017) 148– doi:10.1145/3640457.3687112. 168. [5] Z. Wang, Z. Chu, T. V. Doan, S. Ni, M. Yang, [19] F. Macagno, S. Bigi, Analyzing dialogue moves W. Zhang, History, development, and principles of in chronic care communication: Dialogical intenlarge language models: an introductory survey, AI tions and customization of recommendations for and Ethics 5 (2025) 1955–1971. the assessment of medical deliberation, Journal of [6] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, Argumentation in Context 9 (2020) 167–198.

H. Chen, X. Yi, C. Wang, Y. Wang, et al., A sur- [20] D. Walton, How the context of dialogue of an arvey on evaluation of large language models, ACM gument influences its evaluation, Informal Logic a transactions on intelligent systems and technology Canadian approach to Argument (2019) 196–233. 15 (2024) 1–45. [21] D. Walton, Burden of proof in deliberation dialogs, [7] N. Epley, A. Waytz, J. T. Cacioppo, On seeing hu- in: Argumentation in Multi-Agent Systems: 6th man: a three-factor theory of anthropomorphism., International Workshop, ArgMAS 2009, Budapest, Psychological review 114 (2007) 864. Hungary, May 12, 2009. Revised Selected and In[8] A. P. Chaves, M. A. Gerosa, How should my chat- vited Papers 6, Springer, 2010, pp. 1–22. bot interact? a survey on social characteristics in [22] F. Castagna, N. Kökciyan, I. Sassoon, S. Parsons, human–chatbot interaction design, International E. Sklar, Computational argumentation-based chatJournal of Human–Computer Interaction 37 (2021) bots: a survey, Journal of Artificial Intelligence 729–758. Research 80 (2024) 1271–1310. [9] A. Zhang, Y. Chen, L. Sheng, X. Wang, T.-S. Chua, [23] M. Di Bratto, A. Origlia, M. Di Maro, S. Mennella, On generative agents in recommendation, in: Pro- Linguistics-based dialogue simulations to evaluate ceedings of the 47th international ACM SIGIR con- argumentative conversational recommender sysference on research and development in Informa- tems, User Modeling and User-Adapted Interaction tion Retrieval, 2024, pp. 1807–1817. (2024) 1–31. [ 10 ] A. Kantharuban, J. Milbauer, E. Strubell, G. Neu- [24] P. Pu, L. Chen, R. Hu, A user-centric evaluation big, Stereotype or personalization? user identity framework for recommender systems, in: Proceedbiases chatbot recommendations, arXiv preprint ings of the fifth ACM conference on Recommender arXiv:2410.05613 (2024). systems, 2011, pp. 157–164. [11] Í. Silva, L. Marinho, A. Said, M. C. Willemsen, Lever- [25] C. Bartneck, D. Kulić, E. Croft, S. Zoghbi, Meaaging chatgpt for automated human-centered expla- surement instruments for the anthropomorphism, nations in recommender systems, in: Proceedings animacy, likeability, perceived intelligence, and perof the 29th International Conference on Intelligent ceived safety of robots, International journal of User Interfaces, 2024, pp. 597–608. social robotics 1 (2009) 71–81. [ 12 ] R. Sun, X. Li, A. Akella, J. A. Konstan, Large [26] C. Bartneck, Godspeed questionnaire series: Translanguage models as conversational movie rec- lations and usage, in: International handbook of ommenders: A user study, arXiv preprint behavioral health assessment, Springer, 2023, pp. arXiv:2404.19093 (2024). 1–35. [ 13 ] Q. Ma, X. Ren, C. Huang, Xrec: Large language [27] B. J. Grosz, C. L. Sidner, Attention, intentions, and models for explainable recommendation, arXiv the structure of discourse, Computational linguis

Declaration on Generative AI

tics 12 ( 1986 ) 175 - 204 . [28]

S. A.

Hayati ,

Kang ,

Zhu ,

Shi ,

Yu , Inspired:

arXiv preprint arXiv: 2009 . 14306 ( 2020 ). [29]

Bermejo-Luque , The linguistic-normative model

of argumentation, Cogency 9 ( 2017 ) 7 - 30 . [30]

Di Bratto ,

Orrico ,

Budeanu , M. Mafia,

in Recommendation Dialogues , 2022 , pp. 121 - 127 .

doi:10 .4000/books.aaccademia. 10564 . [31] A. D. Kaplan , T. L.

Sanders , P. A.

Hancock , Likert

Robotics 13 ( 2021 ) 1553 - 1562 . [32]

D. K.

Stangl , Encyclopedia of statistics in behavioral

science , 2008 .