1. Introduction

1613-0073

Modulation via Reinforcement Learning and Prompted Language Generation

Christian Tamantini

christian.tamantini@cnr.it 0 1

Gloria Beraldo

gloria.beraldo@cnr.it 0 1

Alessandro Umbrico

alessandro.umbrico@cnr.it 0 1

Andrea Orlandini

andrea.orlandini@cnr.it 0 1

Workshop

Human-Robot Interaction, Reinforcement Learning, Large Language Models, Prompt Engineering

0 Institute of Cognitive Sciences and Technologies, National Research Council of Italy , 00196 Rome , Italy 1 Workshop on Social Robotics for Human-Centered Assistive and Rehabilitation AI (a Fit4MedRob event) - ICSR 2025

In the context of socially assistive robotics, there is a growing need for interaction strategies that can adapt to users' emotional states in real time, as fixed or generic communication styles often fail to sustain user engagement or meet individual motivational needs, especially in long-term human-robot interaction. To address this challenge, this paper presents a novel framework for adaptive interaction style modulation in socially assistive agents, combining large language models (LLMs) with reinforcement learning based on real-time emotion recognition. The proposed architecture leverages multimodal sensing to monitor the user's afective state and dynamically selects among predefined communicative styles using Thompson Sampling. At each interaction turn, the user's emotional feedback is converted into a scalar reward, allowing the system to reinforce styles that yield more positive afective outcomes. Style conditioning is operationalized through prompting strategies that guide the LLM to generate responses aligned with the selected tone. A preliminary evaluation using VADER sentiment analysis demonstrates that stylistic prompts successfully induce measurable diferences in sentiment polarity, neutrality, and verbosity. These findings suggest the viability of our approach to style-aware dialogue generation and support its potential for long-term adaptation in personalized human-agent interaction.

Generation

1. Introduction

In human-robot interaction, the quality of the communicative exchange is a crucial determinant of user engagement, trust, and adherence over time [ 1, 2, 3 ]. While delivering the correct content is necessary, growing evidence suggests that how an artificial agent communicates, i.e., its interaction style, can significantly influence the user’s experience and willingness to continue interacting [ 4, 5 ].

Interaction style refers to the expressive modality through which a system delivers its output, encompassing both verbal and non-verbal aspects. This includes linguistic tone, prosody, afective cues, as well as physical parameters such as movement expressiveness or compliance in embodied agents that convey physical interaction [ 6 ]. While diferent styles may convey the same task content, they can have divergent emotional impacts. A communication style that fails to align with the user’s preferences or emotional state may result in discomfort, reduced trust, or even disengagement [ 7 ].

Consider, for example, a virtual assistant delivering motivational or instructional feedback. The same message may be conveyed in a calm, neutral manner or with greater emotional warmth and enthusiasm. Although the semantic content is preserved, the user may respond diferently depending on the emotional framing. In long-term interactions, maintaining engagement and emotional resonance is essential, particularly when the agent operates in support-oriented roles [ 8, 9 ].

Therefore, to endow artificial agents with the capability of autonomously learning and implementing diferent communication styles, this work introduces a modular architecture integrating Reinforcement Learning (RL) module with a Large Language Model (LLM)-based utterances generation pipeline, enabling the agent to adjust its verbal behavior based on the user’s estimated afective state.

CEUR

ceur-ws.org

In addition to the architectural contribution, this paper presents a preliminary evaluation aimed at validating the generative capabilities of the dialogue module. Specifically, a set of stylistically constrained prompts was used to generate responses across diferent interaction styles, and the resulting utterances were analyzed using sentiment analysis metrics.

2. Proposed Framework 2.1. Speech-To-Text

The Speech-To-Text module is responsible for transcribing the user’s spoken input into written text in real time [ 10 ]. This transcription serves as the primary input for the linguistic understanding and response generation processes. Given the importance of accurately interpreting user utterances in emotionally sensitive and personalized interactions, the module must ensure both lexical accuracy and robustness to variations in speech patterns, accents, and background noise.

The transcribed utterance is passed to the Dialogue Generation module, where it is used to inform the response generation. By enabling seamless capture of user input, the Speech-To-Text component plays a critical role in grounding the interaction in natural, intuitive communication.

2.2. Emotion Recognition

The Emotion Recognition module estimates the user’s afective state in real time, enabling the system to adapt its interaction strategy based on the perceived emotional response. Accurate emotion recognition is essential to support personalized and empathetic interaction, particularly in behavior change scenarios where emotional engagement is closely tied to adherence and motivation.

Diferent sensing modalities can be employed to infer the user’s emotional state, each with distinct advantages and limitations. First of all, facial expression recognition is one of the most common techniques in afective computing, leveraging computer vision models to classify discrete emotions or compute continuous afective dimensions such as valence and arousal [ 11 ]. While it is efective in controlled settings, this method is sensitive to occlusions, head pose variations, and lighting conditions [ 12 ]. Its reliability assumes that the user is positioned frontally and remains visually accessible to the camera, which may not always hold in naturalistic environments.

Physiological sensing ofers an alternative that bypasses visual constraints by analyzing biosignals such as heart rate variability, skin conductance, or respiration patterns [ 13 ]. These signals provide rich information about autonomic nervous system activity, allowing for continuous estimation [ 14 ]. However, this approach typically requires the user to wear dedicated sensors (e.g., smartwatches, chest straps), which may reduce practicality and user acceptance in long-term scenarios.

Lastly, posture-based emotion recognition represents a more recent direction, exploiting skeletal tracking or full-body pose estimation from RGB or depth cameras [ 15 ]. These methods enable contactless afect sensing based on body configuration and movement dynamics, without requiring frontal face visibility. Posture-based emotion recognition is particularly suited to scenarios in which the user may not be facing the camera but remains physically expressive through gestures or body orientation.

Each of these modalities can be used independently or in combination to enhance the robustness of emotion recognition. In this framework, the emotional signal, regardless of how it is acquired, is mapped to a scalar reward that reflects the afective quality of the interaction and informs the reinforcement learning process.

2.3. Empathic Style Learning

The Empathic Style Learning module governs the agent’s ability to personalize its communicative behavior in real-time. Its goal is to select, at each conversational step, the most appropriate interaction style to promote a positive and engaging user experience. By incorporating implicit emotional cues from the user, the system continually refines its strategy to sustain emotional resonance and foster long-term involvement.

This adaptive mechanism is built upon a reinforcement learning framework that operates at the level of style modulation. The module receives as input the user’s detected emotional responses from the current interaction window and computes a reward signal reflecting the afective impact of the last interaction style used. Based on this feedback, the system updates its internal belief about the efectiveness of each style and probabilistically selects the style to be used in the next interaction cycle.

2.3.1. Reward Computation

The first component of the Empathic Style Learning module is the Reward Computation. It is therefore necessary to score the quality of the interaction at each iteration to assign a score to the interaction style implemented at the previous turn.

Given a multimodal monitoring of the user, the Valence of the detected emotion ( () ) should be taken into account, reflecting the positive or negative emotional charge of the user. These scores can be derived from prior literature and quantify how each emotion contributes to the perceived quality of the interaction [ 16 ].

To capture the afective outcome of a full interaction turn, the system computes the mean valence score across all detected expressions during that window. Formally, the reward at time is defined as: =

=1 1 ∑ ( ) where is the total number of time instants processed and denotes the emotion detected at frame . This average reward provides a scalar measure of the user’s overall afective state in response to the most recent style employed by the agent. By focusing on continuous, real-time feedback rather than one-time evaluations, the system is capable of tracking afective trends and adjusting its behavior accordingly.

2.3.2. Style Selection

To dynamically adapt its style of interaction, the system employs Thompson Sampling, a Bayesian reinforcement learning algorithm designed for eficient exploration and exploitation in uncertain environments [17]. Each interaction style is treated as an independent arm in a bandit formulation, where the agent maintains a Beta distribution over the probability that each style yields a positive change in user afect.

At each step, the algorithm samples from these distributions and selects the style with the highest sampled value. After executing the selected style, the resulting emotional response is quantified via the reward signal, and the success or failure of the action is evaluated by computing the reward diference: Δ = − −1

If this diference is positive or zero, the action is interpreted as beneficial, and the success count that style is incremented. Otherwise, the failure count is increased. This formulation ensures that the system rewards not just positive emotional valences but also improvements relative to prior interaction states, encouraging strategies that maintain or enhance afective engagement over time.

The probabilistic nature of Thompson Sampling enables the agent to remain responsive to changing user preferences, avoid premature convergence, and maintain suficient exploration to adapt to evolving interaction dynamics—features especially desirable in long-term interaction settings [18].

A graphical representation of the functioning of the Thompson Sampling algorithm implemented in the Style Selection module is reported in Figure 2.

2.4. Dialogue Generation

The Dialogue Generation module governs the verbal interaction between the user and the system, translating incoming user input into semantically coherent and stylistically appropriate system responses. Unlike systems that rely on pre-scripted interaction flows, our approach is designed to respond dynamically to user utterances, enabling open-ended yet style-aware dialogue generation. (1) (2) for

At each interaction turn, the Dialogue Generation module receives two inputs: (i) the latest user utterance, transcribed via the Speech-to-Text module, and (ii) the current interaction style selected by the Empathic Style Learning module. These inputs are used to compose a structured prompt that guides a large language model in producing a contextually appropriate and stylistically aligned response.

In this framework, we operationalize two communicative styles as a representative case study: • Neutral, characterized by direct, factual, and emotionally neutral phrasing, suited for users who prefer eficiency and minimal afective stimulation; • Enthusiastic, marked by positively expressive, motivational language, aimed at encouraging engagement and creating a socially supportive experience.

These styles were selected to instantiate a contrast along the afective expressiveness dimension, which is frequently discussed in the literature on empathic and persuasive communication. Prior studies suggest that user preferences regarding afective intensity may vary significantly across individuals and contexts [ 4, 2 ]. While some users may feel more comfortable with emotionally neutral and to-thepoint communication, others respond more positively to expressive and socially engaging behavior. The proposed framework, however, is not tied to any specific pair of styles. It is generalizable to any set of well-defined communicative behaviors that difer in tone, formality, emotional warmth, or other stylistic dimensions. The selection of styles can be informed by theoretical models (e.g., social presence, communication accommodation theory) or derived empirically through design and user testing, depending on the target application.

The Dialogue Manager generates system utterances using GPT-4 via the OpenAI ChatGPT API. For each user input, a prompt is composed that instructs the model to reformulate the response following the selected style. The prompt template is: “Respond to the following user utterance in a [STYLE] manner, as defined below.

Style definition: [STYLE DEFINITION]

User utterance: ’[USER INPUT]’”

Here, [STYLE] is replaced by the current style, while [STYLE DEFINITION] provides a textual description to condition the model appropriately. The [USER INPUT] field contains the transcribed user utterance. This design allows the system to flexibly generate consistent, stylistically adapted responses to a wide range of inputs while maintaining semantic coherence and task relevance.

By decoupling content planning from style selection and leveraging a generative language model with style conditioning, the Dialogue Generation module supports naturalistic and adaptive conversations, reinforcing the capability of the system to sustain engagement throughout the interaction.

3. Preliminary Evaluation

To explore whether large language models (LLMs) are capable of consistently producing utterances that reflect distinct communicative styles, we conducted a preliminary evaluation based on sentiment analysis. Specifically, we aimed to assess whether stylistic prompts can elicit systematic variations in the afective content of generated responses.

We selected a set of 10 representative user utterances that may occur during an interaction with a socially assistive agent. For each utterance, we generated two responses using the prompting strategy described in this paper, instructing the LLM (GPT-4) to rephrase the system’s reply in two stylistic variants. These styles were chosen to reflect qualitatively diferent approaches to empathy and encouragement in assistive dialogue.

To assess the afective and expressive characteristics of the generated responses, we applied sentiment analysis using the VADER (Valence Aware Dictionary and sEntiment Reasoner) module from NLTK [19, 20]. These tools provide complementary perspectives on the emotional and stylistic properties of language. In particular, the following VADER items were computed: • Polarity: a normalized polarity value between −1 and +1, summarizing the overall sentiment of the sentence based on lexical features and intensifiers; • Positive, Neutral, and Negative: the proportion of text perceived as expressing positive, neutral, or negative sentiment, respectively, with values ranging from 0 to 1 and summing to 1.

In addition to these sentiment metrics, we also calculated the number of words for each response to evaluate diferences in verbosity between styles.

By analyzing these metrics across the two stylistic conditions (Neutral and Enthusiastic), we aim to determine whether the stylistic constraints embedded in the prompt lead to consistent and measurable diferences in the generated responses. This analysis provides a preliminary assessment of the Dialogue Generation module to implement interaction style in a controlled and interpretable manner.

4. Results

positive component did not difer significantly between styles, Enthusiastic responses were rated less Neutral (p < 0.01), reflecting their more expressive nature. Moreover, the negative sentiment was always rated barely close to zero for both styles. Additionally, Enthusiastic responses were significantly longer in terms of word count (p < 0.0001). This aspect may reveal that the Enthusiastic phrasing result to be more verbose text than the other.

These findings suggest that stylistic instructions embedded in the prompt successfully induced measurable and coherent variations in sentiment and expressiveness, supporting the use of prompting as a viable mechanism for modulating interaction style in adaptive agents.

5. Conclusion

This work introduced a modular framework for the real-time modulation of interaction style in assistive human-agent communication. The proposed system integrates multimodal emotion recognition, reinforcement learning, and prompting strategies to enable adaptive, afect-sensitive behavior in large language model (LLM)-based dialogue agents.

Interaction styles are selected through a Thompson Sampling algorithm, which optimizes the selection policy based on continuous user afect monitoring. Each style is operationalized via prompt conditioning of an LLM, ensuring that the generated responses are both contextually appropriate and stylistically consistent.

Preliminary evaluation focused primarily on the generative capabilities of the dialogue module, assessing the extent to which prompt-based conditioning can modulate style in LLM-driven responses. While the results confirmed significant and coherent stylistic variations, the study did not include real-time trials with end-users in assistive scenarios. As such, the efectiveness of the full adaptive framework, including the closed-loop integration of emotion recognition, style selection, and dialogue generation, remains to be validated in long-term, ecologically valid interactions.

Future work will therefore address these limitations by: (i) expanding the repertoire of communicative styles and afective adaptation strategies; (ii) implementing a multimodal emotion recognition pipeline; and (iii) conducting controlled and longitudinal user studies in real-world assistive contexts to evaluate the impact of adaptive style modulation on user engagement, trust, and task performance. These steps will enable a more comprehensive validation of the proposed framework and its potential for deployment in practical assistive human–agent interaction scenarios.

Acknowledgments

This work was partially supported by the Italian Ministry of Research, under the complementary actions to the NRRP “Fit4MedRob - Fit for Medical Robotics” Grant PNC0000007, (CUP: B53C22006990001) and partially by Next Generation EU – “Age-It – Ageing Well in an Ageing Society” project (PE0000015), National Recovery and Resilience Plan (NRRP).

Declaration on Generative AI

During the preparation of this work, the authors used generative AI tools (specifically, OpenAI’s GPT-4) to assist with grammar and spelling checks. reliable in the context of interactive media? a new metric to analyse their performance, in: EmotionIMX: Considering Emotions in Multimedia Experience (ACM IMX 2022 Workshop), 2022. [17] T. Zhang, Feel-good thompson sampling for contextual bandits and reinforcement learning, SIAM

Journal on Mathematics of Data Science 4 (2022) 834–857. [18] R. Molle, C. Tamantini, C. Lauretti, E. M. Romano, L. Zollo, An online reinforcement learning method to improve control adaptability in robot-aided rehabilitation, Engineering Applications of Artificial Intelligence 161 (2025) 112248. [19] C. Hutto, E. Gilbert, Vader: A parsimonious rule-based model for sentiment analysis of social media text, in: Proceedings of the international AAAI conference on web and social media, volume 8, 2014, pp. 216–225. [20] A. Borg, M. Boldt, Using vader sentiment and svm for predicting customer response sentiment, Expert Systems with Applications 162 (2020) 113746.

[1]

B. J.

Fogg , A behavior model for persuasive design , in: Proceedings of the 4th international Conference on Persuasive Technology , 2009 , pp. 1 - 7 .

[2]

Orji ,

G. F.

Tondello ,

L. E.

Nacke , Personalizing persuasive strategies in gameful systems to gamification user types , in: Proceedings of the 2018 CHI conference on human factors in computing systems , 2018 , pp. 1 - 14 .

[3]

Masthof ,

Vassileva , Personalized persuasion for behaviour change , Personalized HumanComputer Interaction , Walter de Gruyter GmbH Co KG (Ed.) ( 2023 ) 205 - 235 .

[4]

T. W.

Bickmore ,

R. W.

Picard , Establishing and maintaining long-term human-computer relationships, ACM Transactions on Computer-Human Interaction (TOCHI) 12 ( 2005 ) 293 - 327 .

[5]

Ö. N.

Yalçın , Empathy framework for embodied conversational agents , Cognitive Systems Research 59 ( 2020 ) 123 - 132 .

[6]

Tamantini ,

K. P.

Langlois , J. De Winter , P. H. A.

Mohamadi , D.

Beckwée , E. Swinnen, T.

Verstraten , B.

Vanderborght , L. Zollo, Promoting active participation in robot-aided rehabilitation via machine learning and impedance control, Frontiers in digital health 7 ( 2025 ) 1559796 .

[7]

Purington ,

J. G.

Taft ,

Sannon ,

N. N.

Bazarova ,

S. H.

Taylor , ” alexa is my new bf” social roles, user satisfaction, and personification of the amazon echo , in: Proceedings of the 2017 CHI conference extended abstracts on human factors in computing systems , 2017 , pp. 2853 - 2859 .

[8]

R. A.

Calvo ,

S. D

'Mello ,

J. M.

Gratch ,

Kappas , The Oxford handbook of afective computing , Oxford University Press, 2015 .

[9]

Beraldo ,

Tamantini ,

Umbrico ,

Orlandini , Fostering behavior change through cognitive social robotics , in: International Conference on Social Robotics , Springer, 2024 , pp. 279 - 291 .

[10]

Trivedi ,

Pant ,

Shah ,

Sonik ,

Agrawal , Speech to text and text to speech recognition systems-areview , IOSR J. Comput. Eng 20 ( 2018 ) 36 - 43 .

[11]

B. C.

Ko , A brief review of facial emotion recognition based on visual information , sensors 18 ( 2018 ) 401 .

[12]

F. Z.

Canal ,

T. R.

Müller ,

J. C.

Matias ,

G. G.

Scotton , A. R. de Sa Junior , E.

Pozzebon , A. C.

Sobieranski , A survey on facial emotion recognition techniques: A state-of-the-art literature review , Information Sciences 582 ( 2022 ) 593 - 617 .

[13]

Cittadini ,

Tamantini ,

Scotto di Luzio ,

Lauretti ,

Zollo ,

Cordella , Afective state estimation based on russell's model and physiological measurements , Scientific reports 13 ( 2023 ) 9786 .

[14]

Tamantini ,

M. L.

Cristofanelli ,

Fracasso ,

Umbrico ,

Cortellessa ,

Orlandini ,

Cordella , Physiological sensor technologies in workload estimation: A review , IEEE Sensors Journal ( 2025 ).

[15]

P. V.

Paiva ,

J. J.

Ramos ,

Gavrilova ,

M. A.

Carvalho , Skelett-skeleton-to-emotion transfer transformer , IEEE Access ( 2025 ).

[16]

E. V.

Sampaio ,

Lévêque , M. P. da Silva,

Le Callet , Are facial expression recognition algorithms