<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Strategic Conversations: LLMs Argumentation and User Perception in Movie Recommendation Dialogues</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Valeria Mauro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martina Di Bratto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valentina Russo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Azzurra Mancini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Grazioso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Logogramma S.r.l.</institution>
          ,
          <addr-line>Naples</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Catania</institution>
          ,
          <addr-line>Catania</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This study investigates the persuasive and argumentative behaviors of two LLM-based chatbots, ChatGPT and Gemini, within the context of movie recommendation dialogues. Drawing on insights from argumentation-based dialogue and anthropomorphism research, we introduce a fine-grained annotation scheme to analyze chatbot strategies across dialogue phases. Through both linguistic analysis and user evaluation via ResQue and Godspeed questionnaires, we assess the systems' recommendation quality, perceived human-likeness, and strategic variation. Our findings reveal distinct conversational patterns: ChatGPT emphasizes afective engagement and trust-building, while Gemini adopts a more direct and eficiencydriven approach. These strategic diferences are also reflected in the quality of the recommendation and the user perception. Gemini excels in recommendation quality and explanations, while ChatGPT performs better in emotional engagement, transparency, and user satisfaction.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Argumentation-based dialogue</kwd>
        <kwd>Conversational Recommender Systems</kwd>
        <kwd>Anthropomorphism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        CLiC-it 2025: Eleventh Italian Conference on Computational
Linguistics, September 24 — 26, 2025, Cagliari, Italy [1]
* Corresponding author.
† These authors contributed equally.
$ valeria.mauro@phd.unict.it (V. Mauro);
mdibratto@logogramma.com (M. Di Bratto);
vrusso@logogramma.com (V. Russo); amancini@logogramma.com
(A. Mancini); mgrazioso@logogramma.com (M. Grazioso)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License evaluation protocol. Section 5 reports the results of our
Attribution 4.0 International (CC BY 4.0).
conversational pattern analysis and the questionnaire- only capable of dynamically adapting their suggestions
based user study. In Section 6, we discuss our findings based on user behavior [
        <xref ref-type="bibr" rid="ref1">12</xref>
        ], but also of providing clear
and possible future works. and meaningful rationales for their decisions. This
contributes to perceived transparency, an important factor
in fostering trust and understanding in human-AI
inter2. Large Language Models and action [
        <xref ref-type="bibr" rid="ref6">13</xref>
        ]. Moreover, LLMs demonstrate the ability to
Anthropomorphism monitor and reflect on user satisfaction, recognize
behavioral patterns across interactions, and adjust their
Chatbots like ChatGPT and Gemini are powered by Large recommendations accordingly [9]. This continuous
adapLanguage Models (LLMs), which are trained on vast tation and reflective capacity make LLM-based chatbots
amounts of textual data to learn the recurrent structures increasingly efective as customized, socially aware
recand patterns of human language [5, 6]. Mediated by lan- ommenders, simultaneously blurring the line between
guage but implying something beyond it, the social capa- tool and social agent in the eyes of the user.
bilities of LLM-based systems enables them to simulate
a range of human behaviors, thereby reinforcing users’
perceptions of them as human-like. Anthropomorphism, 3. Argumentation-Based
indeed, refers to the tendency to attribute human charac- Recommender Dialogue
teristics, behaviors, motivations, intentions, or emotions
to non-human entities. It is a cognitive process that often Systems
leads people to perceive such systems as more
humanlike than they actually are. This tendency arises for
several reasons. On a broad level, anthropomorphism is a
natural and often automatic human response, driven by
subtle cues in the system’s interface. It functions as a kind
of cognitive shortcut: when users lack complete
information about a non-human agent, they instinctively project
human-like qualities onto it, drawing from readily
accessible anthropocentric knowledge i.e., knowledge about
themselves or about humans in general [7]. The medium
of interaction itself (a dialogue system) makes a degree of
anthropomorphism almost inevitable. Language-based
interaction, turn-taking, and the adoption of roles
typically played by humans are all fundamental triggers for
anthropomorphic attributions. These are further
reinforced when chatbots are given human-like personas,
names, or presumed preferences [8]. Certain linguistic
strategies amplify this efect. For instance, during
recommendation dialogues systems often use expressions that
suggest uniquely human experiences (such as claiming
to have “watched” a movie) or employ first-person
pronouns (“I”, “me”, “my” when expressing opinions about
the previously mentioned item), which reinforces the
illusion of human agency and subjectivity. LLMs can also
engage in interactive explanations, respond to user
feedback, and even emulate emotional responses and social
cues [9]. These abilities are particularly significant in
recommendation scenarios, where personalization is key
to user satisfaction. Systems like ChatGPT and Gemini
can tailor their responses to individual profiles, adapting
to user preferences and communicative styles over time
[
        <xref ref-type="bibr" rid="ref5">10</xref>
        ]. They can ofer context-sensitive recommendations
and justifications, which are especially valuable when
users are unfamiliar with the items being suggested [11].
      </p>
      <p>Recent research highlights that these chatbots are not
Coversational Recommender Systems (CoRS) have
attracted considerable interest in recent years and are now
a common feature of our everyday interactions with
technology. They are built to enable smooth communication
between people and machines, helping users perform
tasks such as finding information and getting
recommendations. A key aspect of dialogue systems in general is
the use of argumentation, which plays an important role
in their functionality [14]</p>
      <p>Argumentation-based dialogue (ABD) deals with
phenomena depending on the dynamic exchange of
information, which can vary according to turns and
participants. ABD studies often builds on Walton and Krabbe’s
dialogue classification framework [ 15], which
considers participants’ knowledge, their goals, and the rules
guiding the conversation [16]. They define six dialogue
categories, such as Information Seeking, Persuasion,
Deliberation, Negotiation, and Eristic. Identifying the
dialogue type is especially helpful in analyzing efective
dialogue moves to achieve communication goals,
particularly in human-machine interactions. We chose the
recommendation task since it is well-suited for
evaluating the argumentation process in a human-machine
interaction, thanks to its inherently dialogical nature and
clear objective. It typically follows a two-phase structure,
Exploration and Exploitation (E&amp;E). In the exploration
phase, the system seeks new information, while in the
exploitation phase, it leverages the most promising known
option[17].</p>
      <p>The Exploration phase can be associated with Walton’s
Information Seeking dialogue, or more specifically, the
Information Sharing type, as in real dialogues the
situation of lacking knowledge is often dynamic rather than
static [18, 19]. The Exploitation phase, on the other hand,
aligns with the deliberation dialogue, a cooperative form
of interaction in which participants work together to find Participants were then presented with a detailed set of
a solution to a shared problem while considering every- instructions on how to use ChatGPT or Gemini and how
one’s interests [20]. In this context, argumentation plays to share their conversations. Users were free to interact
a key role in proposing solutions, supporting them with with the bots without any conversational constraints.
reasons, and evaluating alternatives [21], all essential After completing the task using the assigned system, they
features for CoRS. This is especially relevant today with submitted the link to their chat in the designated field.
the advent of LLMs: integrating computational argumen- A demographic survey followed, collecting
informatation formalisms could help address challenges such as tion on gender, age, education level, and prior experience
the lack of explainability, transparency, and governabil- with the chatbot.
ity [22, 23], thus maintaining a trustworthy perception Finally, participants completed the adapted ResQue
among users. The aim of this work is to investigate the [24] and Godspeed questionnaires [25, 26]: the former to
behavior of LLM-based chatbots in recommendation sce- evaluate the quality of the recommendation, the latter
narios, evaluating diferences and similarities in their for perceived anthropomorphism.
argumentation strategies, and assessing, through human
evaluation, the quality of the recommendations and the 4.1. Dialogue annotation scheme
perceived anthropomorphism, as well as whether these
aspects correlate with the identified argumentation
strategies.</p>
      <sec id="sec-1-1">
        <title>The annotation scheme builds on the existing literature</title>
        <p>while introducing novel extensions. The units of
analysis are dialogical moves, clusters of words or dialogue
segments expressing a communicative intention [18, 27].
4. Data collection &amp; methodology A move typically corresponds to a single dialogical turn,
though a turn may employ multiple strategies to pursue
In this study, we decided to evaluate two LLM-based chat- subgoals. We deployed a set of category for the
recombots in the movie recommendation domain: Gemini and mender’s and seeker’s utterances. This means that the
ChatGPT. More specifically, our objective was to eval- annotation scheme encompasses eighteen and nineteen
uate the systems’ performance as recommenders and, categories, respectively. The category annotation scheme
more broadly, as human-passing interlocutors through is twofold. To account for the recommender’s strategies
user ratings. Participants assessed both the quality of (i.e., the chatbot’s), twelve strategies were initially
sethe recommendations and their perceptions of anthro- lected from Hayati et al. [28], who defined this tagset
pomorphism, likeability, and intelligence. A between- in the context of human-human interaction. The first
subjects design was chosen to avoid carryover efects eight are sociable strategies aimed at building rapport
and to reduce the cognitive load and fatigue associated with the seeker: Personal Opinion (PO), used by the
recwith completing the same questionnaire twice, which is ommender to share subjective views about a movie, such
common in within-subjects designs. Participants were as opinions on the plot, actors, or other elements;
Permainly recruited from the BA and MA programs of the sonal Experience (PE), used by the recommender to share
Department of Humanities at the University of Catania. personal experiences related to a movie (e.g., mentioning
The most represented age group is that of participants they’ve watched it several times) in order to persuade
under 30, accounting for 87.8% of those who took part the seeker; Similarity (S), used to express empathy and
in the ChatGPT test and 92.5% of those in the Gemini alignment with the seeker’s preferences, creating a sense
test. The survey was administered via Google Forms, and of like-mindedness and building trust; Encouragement
data collection took place over approximately one month, (E), used to praise the seeker’s taste and encourage them
from early February to mid-March 2025. A total of 95 to watch the recommended movie; Ofering Help (OH),
participants took part in the study, resulting in 81 con- used by the recommender to explicitly express an
inversations correctly submitted via the designated input tention to help the seeker or to be transparent about
box, comprising 2,362 dialogue turns overall1. The study their recommendations; Preference Confirmation (PC),
procedure followed these steps: Participants read a brief used by the recommender to ask about or rephrase the
introductory statement outlining the task (i.e., prompting seeker’s preferences, making their reasoning process
exa film recommendation from ChatGPT or Gemini in a plicit; Credibility (C), used by the recommender to display
casual, conversational style). They were also informed expertise or trustworthiness by providing factual
inforthat additional instructions would follow and that they mation about the movie (e.g., plot, cast, or awards), and
would be asked to submit an anonymous link to their con- Self-Modeling (SM), used by the recommender to present
versation. In order to proceed, participants were required themselves as a role model, for example by watching the
to check two consent boxes on the same page. movie first to encourage the seeker to do the same. Two
additional categories cover preference elicitation:
Experi1https://github.com/marcograzioso/human-bot-recommendationdialogues-it
ence Inquiry (EI), used by the recommender to ask about
the seeker’s past movie-watching experiences, such as
whether they have seen a specific movie; and Opinion
Inquiry (OI), used to ask for the seeker’s opinion on specific
movie-related attributes, such as their thoughts on the
plot or the actors’ performances. Two functional labels
are also included: Recommendation (R) and No Strategy
(NS). The former (R) is intended as the final claim in the
argumentation process, specifically a communicative act
aimed at justifying a target claim [29]. The latter (NS)
is used for phatic or neutral moves, such as greetings
or backchanneling. Given the versatility of modern
conversational AI systems like ChatGPT and Gemini, fully
capable of posing technical questions across domains, we
introduced six further categories to capture a broader
range of preference elicitation strategies:</p>
        <p>Sociable Strategies
Personal Opinion Recommendation
Personal Experience No Strategy</p>
        <p>Similarity Streaming Service Inquiry
Encouragement Genre Inquiry</p>
        <p>Ofering Help Actor Inquiry
Preference Confirmation Director Inquiry</p>
        <p>Credibility Plot Inquiry</p>
        <p>Self-Modeling Action Inquiry</p>
        <p>Experience Inquiry Opinion Inquiry
(i.e., the number of the conversation), the turn number
(counted from the beginning of the dialogue), the author
• Streaming Service Inquiry (SSI): the recom- (either the user or the chatbot), the dialogic move
unmender asks about the seeker’s (i.e. the user) der analysis, and its corresponding label. A single turn
preferred streaming platforms; may contain multiple dialogic moves, each annotated
• Genre Inquiry (GI): the recommender asks separately.</p>
        <p>about the seeker’s preferred genres; To evaluate annotation quality, a second annotator
• Actor Inquiry (AcI): the recommender asks with linguistic background independently annotated 15%
about favorite actors; of the total dialogue moves in the dataset. Inter-annotator
• Director Inquiry (DI): the recommender asks agreement was then calculated using Cohen’s Kappa,
about favorite directors; resulting in a score of 0.826, which indicates a high level
• Plot Inquiry (PI): the recommender asks about of agreement between the two annotators.</p>
        <p>preferred narrative or thematic features;
• Action Inquiry (AI): the recommender prompts 4.2. User evaluation questionnaires
the user regarding the next step in the
conversation.</p>
      </sec>
      <sec id="sec-1-2">
        <title>The evaluation constructs were adapted and translated</title>
        <p>into Italian from two well-established models: the
The last two categories require further clarification. Since ResQue questionnaire [24] and the Godspeed
questiona movie inevitably involves a wide array of features that naires [25]. Together, these provide a robust,
usercannot be fully captured by any single fine-grained strat- centered evaluation framework. The final questionnaire
egy, Plot Inquiry (PI) was defined broadly. It includes consisted of 22 items corresponding to 16 constructs. All
questions not only about narrative content but also about items were rated on a 5-point Likert scale. ResQue ofers
a film’s perceived tone (e.g., “pure fun” vs. “deep”), cul- a concise yet powerful tool for assessing users’
perceptural status (e.g., “cult classic”), or recency. Action Inquiry tions, beliefs, attitudes, and acceptance of a recommender
(AI), instead, accounts for the fact that even domain- system. Due to the study’s scope and time constraints,
restricted dialogues can drift in topic. This label is as- we adopted the “short version” of ResQue, using one
signed when the chatbot explicitly asks about the user’s item per construct. Two constructs (Recommendation
intended course of action (for instance, “What would you Diversity and Interaction Adequacy) were excluded. The
like to do now?”), a strong signal that the system is adapt- final ResQue-based questionnaire included 13 constructs
ing to dynamic user needs, which may evolve during the and items. All original labels were preserved, except
conversation. All the sociable strategies used to establish for Purchase Intention, which was renamed Behavioral
the conversation are reported in Table 1. Intention to better reflect the study’s focus (Table 3).</p>
        <p>To annotate the seeker’s strategies, eleven strategies From the Godspeed model, we selected three of the five
grouped into four categories were initially adopted from original questionnaires: Anthropomorphism (Godspeed
Di Bratto et al. [30]. However, the scope of this work I), Likeability (Godspeed III), and Perceived Intelligence
is centered on analyzing the behavior of LLM-based (Godspeed IV) (Table 4). Two items were removed from
chatbots in engaging conversations using argumentative each construct to streamline the questionnaire. Minimum
strategies. Therefore, the analysis of seeker utterances coverage of the constructs’ theoretical domains is
guarhas not been addressed. Table 2 reports a sample of an- anteed, as the items from each questionnaire are
internotated dialogues. Each row includes the dialogue ID related. To ensure clarity and consistency, all Godspeed
author
user
chatgpt
chatgpt
chatgpt
user
chatgpt
chatgpt
text
Quale film mi consiglieresti?
Dipende da cosa ti piace!
Preferisci azione, thriller, fantascienza, horror, dramma o commedia?
Oppure cerchi qualcosa di più specifico, tipo un film recente o un cult imperdibile?
Fantascienza azione
Ottima scelta!
Ecco alcuni film di fantascienza d’azione che potresti apprezzare: "Mad Max: Fury Road" (2015) – Un’esplosione
di adrenalina in un mondo post-apocalittico con inseguimenti folli e azione non-stop.
category
[null]
OI
GI
PI
[null]
EG
R, C</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Results</title>
      <p>semantic diferential scales were adapted to Likert-type
items. This choice is supported by [31] who argue that
Likert scales may improve response accuracy. Moreover, 5.1. Conversational Pattern Analysis
given that ChatGPT and Gemini are disembodied agents,
we either omitted or carefully rephrased terms that re- Once the annotation phase was completed, we performed
fer to physical appearance in order to avoid ambiguity an analysis of the distribution of dialogue moves across
in the Italian target language. For instance, the expres- 20 dialogue turns to compare Gemini and GPT persuasive
sion "human-like", typically rendered in existing Italian strategies (Figure 1). The analysis reveals clear strategic
translations as "dall’aspetto umano" (’with a human ap- diferences between the two LLM-based chatbots,
Chatpearance’) [26] was considered potentially misleading GPT and Gemini, in their approach to persuading users
when applied to text-based agents. Instead, we adapted to watch a movie. Both models exhibit a dominant
rethe wording to better fit the nature of the evaluated sys- liance on the Recommendation (R) strategy, with
Chattems and, for the same reason, chose to exclude Animacy GPT which tends to delay the exploitation phase giving
(Godspeed II) and Perceived Safety (Godspeed V) from room to information gathering, while in Gemini we find
our evaluation. For future analysis, would be useful to also R as primary move, along with the preference
coladopt Item Response Theory (ITR)-based models [32]. lection.</p>
      <p>These models ofer a principled way to address individ- This shared pattern suggests a common persuasive
ual variability in Likert scale use by modeling latent traits architecture in which the models delay direct
recomwhile accounting for person- and item-specific influences. mendations until initial rapport and exploration phases,
Moreover, advanced IRT extensions such as multidimen- consistent with human-like persuasive communication
sional and mixture models provide additional flexibility (see Di Bratto et al. [30] for the analysis of human
recto handle systematic response biases. We believe this ommender strategies).
methodological choice would strengthens the validity However, notable divergences emerge in the
deployand fairness of our analysis and reduces bias due to dif- ment of other strategies. ChatGPT adopts a broader and
ferential scale usage across respondents. more diversified strategy set in the early turns. It
frequently uses Genre Inquiry (GI), Plot Inquiry (PI),
PreferGODSPEED I: ANTHROPOMORPHISM
1. The chatbot seems natural.
2. The chatbot seems human-like.
3. The chatbot seems conscious.</p>
      <p>Godspeed Questionnaire Items
GODSPEED III: LIKEABILITY
1. The chatbot is friendly.
2. The chatbot is kind.</p>
      <p>3. The chatbot is nice.
ence Confirmation (PC) and Credibility (C) in the initial
stages (Turns 3–4), indicating a deliberate efort to build
social rapport and create a sense of trust by providing
credible domain information and increasing perception
as domain expert. This emotionally grounded approach is
further supported by ChatGPT’s usage of Encouragement
(EG), which enrich the persuasive context by portraying
the bot as a cooperative and relatable interlocutor.</p>
      <p>In contrast, Gemini shows a more focused and
functional strategy for the exploration phase, that seems
wider (it ends at turn 5). Here, Recommendation move
(R) is accompanied by domain-specific inquiries such
as Opinion Inquiry (OI) followed by Genre Inquiry (GI).</p>
      <p>This indicates a deepening strategies in investigating
user preferences to get more accurate information. The
exploitation phase, on the other hand, presents
rapportbuilding strategies such as self-Modelling (SM) and
Encouragment (EG). Here, the broader tactical spectrum Figure 1: Comparison of dialogue move distributions for
suggests a design that intertwines personalisation with Gemini (top) and ChatGPT (bottom), showing diferences in
persuasion rather than staging them sequentially. communicative strategy usage.</p>
      <p>Finally, the occurrence of No Strategy (NS) moves
remains low for both models, even if ChatGPT seems to
use them more at the beginning of the conversation. suggesting that both systems are perceived as
moder</p>
      <p>In summary, ChatGPT demonstrates a human- ately natural, with no clear advantage. In terms of
percentered persuasive style, combining efective strategies ceived humanness, Gemini again scores higher than
Chatto foster user alignment before making recommenda- GPT which has a more compressed boxplot leaning
totions. Gemini, by contrast, exhibits a more direct and ward machine-like behaviour, indicating that participants
utilitarian persuasion model, emphasizing information tended to view Gemini as more human-like in its outputs.
delivery and content relevance over emotional alignment. This diference is the largest among the three considered
These findings underscore the importance of strategic anthropomorphism-related dimensions and it may reflect
variation in LLM-based recommendation systems and variations in argumentative strategies given the broader
suggest difering design priorities: ChatGPT appears op- tactical spectrum employed by Gemini. Conversely, on
timized for engagement and trust-building, while Gemini the awareness dimension, ChatGPT slightly outperforms
emphasizes eficiency and relevance. Gemini, suggesting that users may attribute a marginally
higher sense of intentionality or contextual sensitivity
5.2. Questionnaires results to ChatGPT. Moving on to Godspeed III, both systems
received high ratings on the friendliness dimension, with
The comparative analysis between Gemini and ChatGPT comparable medians, as the horizontal lines in the boxes
in the context of movie recommendation and perceived are nearly aligned. Both models have multiple outliers
anthropomorphism highlights notable diferences in user on the low end, i.e. data points that lie significantly
outperception and interaction quality. As shown in Figure side the range of most other values in the dataset. This
2, Gemini and ChatGPT were rated similarly in the di- suggests that a few respondents rated both ChatGPT and
mension of naturalness, with Gemini receiving slightly Gemini very low in friendliness. Gemini also
outperhigher scores compared to ChatGPT which also received formed ChatGPT on kindness compared to ChatGPT that
2 and 3 evaluations. However, the diference is small shows extreme low values, indicating it was perceived
as marginally more courteous. The largest gap in the
Likeability subset emerges in the pleasantness dimension:
Gemini has a distribution more centered, while
ChatGPT shows more variability and more extreme negative
cases. This diference may suggest that Gemini evokes
a more consistently positive emotional reaction among Figure 4: Participants’ Ratings on the interaction with
Chatusers, potentially linked to its conversational tone or GPT regarding recommendation quality
afective cues. In terms of competence, Gemini again
received slightly higher ratings than ChatGPT, showing
less dispersion and suggesting that users viewed Gem- ofering explanations, ChatGPT was perceived as clearer
ini as marginally more capable in fulfilling its role as a in making those explanations understandable (i.e.,
transconversational agent. A similar trend is observed in the parency), which may reflect a more accessible or
userknowledgeability dimension, where Gemini frequently friendly communication style. In terms of Perceived Ease
receives high scores, with few extremes. Although the of Use, ChatGPT was favored: it received higher scores
diference is modest, it may imply that Gemini is per- for both task completion ease (Mean = 4.512 vs. 4.275)
ceived as slightly more informative or better grounded and ease of communicating preferences (Control, Mean
in its responses. Finally, both systems performed well = 4.525 vs. 4.325). This could reflect a smoother
interacon the responsible dimension, with Gemini showing few tion flow or a greater ability to accurately interpret user
outliers. These scores indicate that users generally found input. With respect to the perceived quality of
recomboth systems to be reasonable and contextually appropri- mendations, Gemini was rated slightly higher in terms of
ate in their responses. Overall, the ratings across these providing good suggestions (Perceived Usefulness, Mean
dimensions suggest that both systems are perceived as = 4.125 vs. 4.00). However, ChatGPT performed better
intelligent, with a slight and consistent advantage for in terms of Overall Satisfaction (Mean = 4.15 vs. 4.048).
Gemini in terms of perceived cognitive abilities. The diference is minimal in building user Confidence and</p>
      <p>Analyzing the quality of the recommendations (Figure Trust regarding the proposed choices (3.8 for Gemini vs.
3 and Figure 4), in terms of Recommendation Accuracy, 3.756 for ChatGPT). Finally, looking at future Use
IntenGemini exhibits greater variability in the ratings, suggest- tions, ChatGPT clearly outperformed Gemini: it received
ing that users perceived a better alignment between their higher ratings for willingness to reuse the chatbot (Mean
preferences and the suggestions provided by ChatGPT. = 3.902 vs. 3.375) but not for likelihood of watching the
However, Gemini outperformed ChatGPT in recommend- recommended films ( Behavioural Intentions, Mean = 3.658
ing novel films, which may indicate a stronger ability to vs. Gemini’s 3.825). Overall, the findings point to a
baldiversify recommendations and introduce lesser-known anced competition between the two systems. Gemini’s
content. Both chatbots were rated equally in terms of strengths lie in novelty and explanation, but ChatGPT is
visual interface, indicating that the design did not sig- preferred for overall user experience and for encouraging
nificantly influence user preference in this area. When continued engagement.
it comes to Explanation, Gemini stood out more clearly:
it received a higher score for explaining why specific
iflms were recommended, and also slightly outperformed 6. Discussion &amp; conclusions
ChatGPT in terms of providing suficient information These findings support the notion that users tend to
evalto make a viewing choice (i.e., Information Suficiency ). uate a recommender primarily based on its instrumental
Interestingly, while Gemini was rated higher in terms of
efectiveness. Likeability factors such as kind (gentile), should be acknowledged to contextualize the scope of
friendly (amichevole), and nice (simpatico) clustered to- these findings. First, while the sample size is robust for a
gether and improved the socio-emotional tone of inter- controlled experimental setup, it may still limit the
genaction, but ofered smaller gains in the perceived quality eralizability of the results to broader user populations
of the recommendation unless paired with a convincing with varying backgrounds, digital literacy, or cultural
recommendation rationale. In this context, ChatGPT’s expectations regarding conversational agents. Second,
early use of strategies such as preference confirmation, participants were exposed to a limited number of
intercredibility statements, and encouragement signals an in- actions per system, which may not fully capture the
dytention to build trust through a socially engaged and namic evolution of trust and satisfaction over extended
emotionally grounded style. The more frequent use of use. Future studies could benefit from a longitudinal
decredibility cues in ChatGPT’s discourse likely contributed sign that tracks user preferences, learning curves, and
to its higher score in Transparency, as users may have behavioral outcomes across multiple sessions. Moreover,
perceived its explanations as clearer and more accessi- the interpretation of constructs such as “human-like” or
ble due to its habit of justifying claims with trustwor- “competent” is inherently subjective and may vary across
thy or relatable references. However, this transparency individuals, even when standardized scales are used. The
advantage may not have fully compensated for Chat- Likert-scale approach, while efective for comparative
GPT’s comparatively lower performance in Explanation analysis, introduces the usual constraints of self-reported
and Recommendation Novelty, where Gemini showed a measures, including social desirability bias and response
stronger profile. Gemini’s conversational architecture centrality. Furthermore, it is important to recognize that
made heavier use of Recommendation moves (R), typi- understanding behavioral diferences between chatbots
cally delivered through a structure of claims followed by is inherently limited by their black-box nature: system
supporting reasons. This discursive pattern may have prompts, fine-tuning strategies, and training data are
typenhanced users’ perception of the system’s explanatory ically undisclosed. While such diferences might stem
power, enabling them to better understand why specific from prompt design or fine-tuning, they could also
resuggestions were made. Moreover, Gemini’s early deploy- sult from user behavior, as diferent dialogic strategies,
ment of a deepening strategy (marked by domain-specific questioning styles, or interactional cues may influence
inquiries such as Opinion Inquiry and Genre Inquiry) al- the model’s responses. In sum, the current findings
oflowed it to gather more precise information about user fer meaningful evidence on how users perceive
compepreferences before initiating recommendations and its tence, warmth, and recommendation quality across two
more outcome-oriented conversational strategy appears state-of-the-art systems, but they should be viewed as
to align with its stronger performance on Behavioural a foundation for further research rather than definitive
Intention measures (i.e. users’ reported likelihood of conclusions. Larger and more diverse samples,
longituwatching the recommended films). The system’s focus dinal protocols, and richer qualitative analyses will be
on precision and justification may have reinforced users’ essential to deepen our understanding of how human-AI
sense of efectiveness and goal-orientation, enhancing interaction unfolds in recommendation contexts.
the perceived utility of the exchange. Conversely,
ChatGPT received higher ratings for Overall Satisfaction and
Future Use Intention. This may be partially attributed 7. Acknowledgments
to its broader engagement strategy, which incorporates
multiple rapport-building elements from the early stages This work is supported by the European Union - Next
of the conversation, contributing to a smoother and more Generation EU under the Italian National Recovery and
socially fulfilling experience. Furthermore, ChatGPT’s Re- silience Plan (NRRP), Mission 4, Component 2,
Ingreater popularity and widespread familiarity likely bol- vestment 1.3, CUP E83C22004640001, partnership on
ster its trustworthiness in users’ eyes. Familiarity breeds “Telecommuni- cations of the Future” (PE00000001 -
proconfidence, and this reputational advantage may have gram “RESTART”). Valeria Mauro’s work is framed in
translated into more favorable subjective evaluations, the context of the industrial internship of PNRR - D.M.
even when objective recommendation quality was com- 118/2023, Inv. 4.1 Public Administration.
parable or slightly lower. Taken together, the data
indicate that while both systems ofer valuable features, their References
strengths lie in diferent areas. Gemini excels in
functional efectiveness, providing novel and well-justified [1] C. Bosco, E. Ježek, M. Polignano, M. Sanguinetti,
recommendations, whereas ChatGPT leads in accessibil- Preface to the Eleventh Italian Conference on
Comity, emotional engagement, and trust, likely amplified by putational Linguistics (CLiC-it 2025), in:
Proceedits widespread cultural recognition. Several limitations
ings of the Eleventh Italian Conference on Compu- preprint arXiv:2406.02377 (2024).</p>
      <p>tational Linguistics (CLiC-it 2025), 2025. [14] H. Prakken, Historical overview of formal
argu[2] T. Yang, L. Chen, Unleashing the retrieval potential mentation, in: Handbook of formal argumentation,
of large language models in conversational recom- College Publications, 2018, pp. 73–141.
mender systems, in: Proceedings of the 18th ACM [15] D. Walton, E. C. Krabbe, Commitment in dialogue:
Conference on Recommender Systems, 2024, pp. Basic concepts of interpersonal reasoning, SUNY
43–52. press, 1995.
[3] L. Friedman, S. Ahuja, D. Allen, Z. Tan, H. Sidahmed, [16] E. Black, N. Maudet, S. Parsons,
ArgumentationC. Long, J. Xie, G. Schubiner, A. Patel, H. Lara, based dialogue, in: Handbook of Formal
Argumenet al., Leveraging large language models in con- tation, Volume 2, College Publications, 2021, p. 511.
versational recommender systems, arXiv preprint [17] C. Gao, W. Lei, X. He, M. de Rijke, T.-S. Chua,
arXiv:2305.07961 (2023). Advances and challenges in conversational
recom[4] Y. Deldjoo, J. Mcauley, S. Sanner, P. Castells, mender systems: A survey, AI Open 2 (2021) 100–
E. Palumbo, S. Zhang, The 1st international work- 126.
shop on risks, opportunities, and evaluation of gen- [18] F. Macagno, S. Bigi, Analyzing the pragmatic
strucerative models in recommendation (roegen), 2024. ture of dialogues, Discourse Studies 19 (2017) 148–
doi:10.1145/3640457.3687112. 168.
[5] Z. Wang, Z. Chu, T. V. Doan, S. Ni, M. Yang, [19] F. Macagno, S. Bigi, Analyzing dialogue moves
W. Zhang, History, development, and principles of in chronic care communication: Dialogical
intenlarge language models: an introductory survey, AI tions and customization of recommendations for
and Ethics 5 (2025) 1955–1971. the assessment of medical deliberation, Journal of
[6] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, Argumentation in Context 9 (2020) 167–198.</p>
      <p>
        H. Chen, X. Yi, C. Wang, Y. Wang, et al., A sur- [20] D. Walton, How the context of dialogue of an
arvey on evaluation of large language models, ACM gument influences its evaluation, Informal Logic a
transactions on intelligent systems and technology Canadian approach to Argument (2019) 196–233.
15 (2024) 1–45. [21] D. Walton, Burden of proof in deliberation dialogs,
[7] N. Epley, A. Waytz, J. T. Cacioppo, On seeing hu- in: Argumentation in Multi-Agent Systems: 6th
man: a three-factor theory of anthropomorphism., International Workshop, ArgMAS 2009, Budapest,
Psychological review 114 (2007) 864. Hungary, May 12, 2009. Revised Selected and
In[8] A. P. Chaves, M. A. Gerosa, How should my chat- vited Papers 6, Springer, 2010, pp. 1–22.
bot interact? a survey on social characteristics in [22] F. Castagna, N. Kökciyan, I. Sassoon, S. Parsons,
human–chatbot interaction design, International E. Sklar, Computational argumentation-based
chatJournal of Human–Computer Interaction 37 (2021) bots: a survey, Journal of Artificial Intelligence
729–758. Research 80 (2024) 1271–1310.
[9] A. Zhang, Y. Chen, L. Sheng, X. Wang, T.-S. Chua, [23] M. Di Bratto, A. Origlia, M. Di Maro, S. Mennella,
On generative agents in recommendation, in: Pro- Linguistics-based dialogue simulations to evaluate
ceedings of the 47th international ACM SIGIR con- argumentative conversational recommender
sysference on research and development in Informa- tems, User Modeling and User-Adapted Interaction
tion Retrieval, 2024, pp. 1807–1817. (2024) 1–31.
[
        <xref ref-type="bibr" rid="ref5">10</xref>
        ] A. Kantharuban, J. Milbauer, E. Strubell, G. Neu- [24] P. Pu, L. Chen, R. Hu, A user-centric evaluation
big, Stereotype or personalization? user identity framework for recommender systems, in:
Proceedbiases chatbot recommendations, arXiv preprint ings of the fifth ACM conference on Recommender
arXiv:2410.05613 (2024). systems, 2011, pp. 157–164.
[11] Í. Silva, L. Marinho, A. Said, M. C. Willemsen, Lever- [25] C. Bartneck, D. Kulić, E. Croft, S. Zoghbi,
Meaaging chatgpt for automated human-centered expla- surement instruments for the anthropomorphism,
nations in recommender systems, in: Proceedings animacy, likeability, perceived intelligence, and
perof the 29th International Conference on Intelligent ceived safety of robots, International journal of
User Interfaces, 2024, pp. 597–608. social robotics 1 (2009) 71–81.
[
        <xref ref-type="bibr" rid="ref1">12</xref>
        ] R. Sun, X. Li, A. Akella, J. A. Konstan, Large [26] C. Bartneck, Godspeed questionnaire series:
Translanguage models as conversational movie rec- lations and usage, in: International handbook of
ommenders: A user study, arXiv preprint behavioral health assessment, Springer, 2023, pp.
arXiv:2404.19093 (2024). 1–35.
[
        <xref ref-type="bibr" rid="ref6">13</xref>
        ] Q. Ma, X. Ren, C. Huang, Xrec: Large language [27] B. J. Grosz, C. L. Sidner, Attention, intentions, and
models for explainable recommendation, arXiv the structure of discourse, Computational
linguis
      </p>
      <p>Declaration on Generative AI</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>tics 12</source>
          (
          <year>1986</year>
          )
          <fpage>175</fpage>
          -
          <lpage>204</lpage>
          . [28]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hayati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          , Inspired:
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>14306</volume>
          (
          <year>2020</year>
          ). [29]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bermejo-Luque</surname>
          </string-name>
          ,
          <article-title>The linguistic-normative model</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>of argumentation, Cogency</source>
          <volume>9</volume>
          (
          <year>2017</year>
          )
          <fpage>7</fpage>
          -
          <lpage>30</lpage>
          . [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Di Bratto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Orrico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Budeanu</surname>
          </string-name>
          , M. Mafia,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>in Recommendation Dialogues</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>121</fpage>
          -
          <lpage>127</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>doi:10</source>
          .4000/books.aaccademia.
          <volume>10564</volume>
          . [31]
          <string-name>
            <surname>A. D. Kaplan</surname>
            ,
            <given-names>T. L.</given-names>
          </string-name>
          <string-name>
            <surname>Sanders</surname>
            ,
            <given-names>P. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hancock</surname>
          </string-name>
          , Likert
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>Robotics</source>
          <volume>13</volume>
          (
          <year>2021</year>
          )
          <fpage>1553</fpage>
          -
          <lpage>1562</lpage>
          . [32]
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Stangl</surname>
          </string-name>
          , Encyclopedia of statistics in behavioral
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>science</surname>
          </string-name>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>