=Paper= {{Paper |id=Vol-2481/paper27 |storemode=property |title=Applying Psychology of Persuasion to Conversational Agents through Reinforcement Learning: an Exploratory Study |pdfUrl=https://ceur-ws.org/Vol-2481/paper27.pdf |volume=Vol-2481 |authors=Francesca Di Massimo,Valentina Carfora,Patrizia Catellani,Marco Piastra |dblpUrl=https://dblp.org/rec/conf/clic-it/MassimoCCP19 }} ==Applying Psychology of Persuasion to Conversational Agents through Reinforcement Learning: an Exploratory Study== https://ceur-ws.org/Vol-2481/paper27.pdf
           Applying Psychology of Persuasion to Conversational Agents
            through Reinforcement Learning: an Exploratory Study
    Francesca Di Massimo1 , Valentina Carfora2 , Patrizia Catellani2 and Marco Piastra1
         1
           Computer Vision and Multimedia Lab, Università degli Studi di Pavia, Italy
              2
                Dipartimento di Psicologia, Università Cattolica di Milano, Italy
                              francesca.dimassimo01@universitadipavia.it
                                        valentina.carfora@unicatt.it
                                        patrizia.catellani@unicatt.it
                                             marco.piastra@unipv.it

                       Abstract                               manager via machine learning techniques, rein-
                                                              forcement learning (RL) in particular, may seem
    This study is set in the framework of task-               attractive. At present, many RL-based approaches
    oriented conversational agents in which                   involve training an agent end-to-end from a dataset
    dialogue management is obtained via Re-                   of recorded dialogues, see for instance Liu (2018).
    inforcement Learning. The aim is to ex-                   However, the chance of obtaining significant re-
    plore the possibility to overcome the typ-                sults in this way entails substantial efforts in both
    ical end-to-end training approach through                 collecting sample data and performing experi-
    the integration of a quantitative model de-               ments. Worse yet, such efforts ought to rely on the
    veloped in the field of persuasion psychol-               even stronger hypothesis that the RL agent would
    ogy. Such integration is expected to accel-               be able to elicit psychosocial aspects on its own.
    erate the training phase and improve the                  As an alternative, in this study we envisage the
    quality of the dialogue obtained. In this                 possibility to enhance the RL process by harness-
    way, the resulting agent would take advan-                ing a model developed and accepted in the field
    tage of some subtle psychological aspects                 of social psychology to provide a more reliable
    of the interaction that would be difficult to             learning ground and a substantial accelerator for
    elicit via end-to-end training. We propose                the process itself.
    a theoretical architecture in which the psy-                 Our study relies on a quantitative, causal model
    chological model above is translated into a               of human behavior being studied in the field of so-
    probabilistic predictor and then integrated               cial psychology (see Carfora et al., 2019) aimed at
    in the reinforcement learning process, in-                assessing the effectiveness of message framing to
    tended in its partially observable variant.               induce healthier nutritional habits. The goal of the
    The experimental validation of the archi-                 model is to assess whether messages with different
    tecture proposed is currently ongoing.                    frames can be differentially persuasive according
                                                              to the users’ psychosocial characteristics.
1   Introduction
                                                              2   Psychological model: Structural
A typical conversational agent has a multi-stage
                                                                  Equation Model
architecture: spoken language, written language
and dialogue management, see Allen et al. (2001).             Three relevant psychosocial antecedents of be-
This study focuses on dialogue management for                 haviour change are the following: Self-Efficacy
task-oriented conversational agents. In particular,           (the individual perception of being able to eat
we focus on the creation of a dialogue manager                healthy), Attitude (the individual evaluation of the
aimed at inducing healthier nutritional habits in             pros and cons) and Intention Change (the indi-
the interactant.                                              vidual willingness of adhering to a healthy diet).
   Given that the task considered involves psy-               These psychosocial dimensions cannot be directly
chosocial aspects that are difficult to program di-           observed and need to be measured as latent vari-
rectly, the idea of achieving an effective dialogue           ables. To this purpose, questionnaires are used,
                                                              each composed by a set of questions or items
                                                              (i.e. observed variables). Self-Efficacy is mea-
   Copyright c 2019 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-     sured with 8 items, each associated to a set of
ternational (CC BY 4.0).                                      answers ranging from "not at all confident" (1)
                          Figure 1: SEM simplified model for the case at hand.




                       Figure 2: DBN translation of the SEM shown in Figure 1.


to "extremely confident" (7). Attitude is assessed     tude and Intention Change in relation to healthy
through 8 items associated to a differential scale     eating.
ranging from 1 to 7 (the higher the score, the more       The overall model is described by the Struc-
positive the attitude). Intention Change is mea-       tural Equation Model (SEM, see Wright, 1921)
sured with three items on a Likert scale, ranging      in Figure 1. For simplicity, only three items are
from 1 (“definitely do not”) to 7 (“definitely do”).   shown for each latent variable. Besides allow-
See Carfora et el. (2019).                             ing the description of latent variables, SEMs are
   In our study, the psychosocial model was as-        causal models in the sense that they allow a sta-
sessed experimentally on a group of volunteers.        tistical analysis of the strength of causal relations
Each participant was first proposed a question-        among the latents themselves, as represented by
naire (Time 1 – T1) for measuring Self-Efficacy,       the arrows in figure. SEMs are linear models, and
Attitude and Intention Change. In a subsequent         thus all causal relations underpin linear equations.
phase (i.e. message intervention), participants           Note that latent variables in a SEM have dif-
were randomly assigned to one of four groups,          ferent roles: in this case gain/non-gain/loss/non-
each receiving a different type of persuasive mes-     loss messages are independent variables, Intention
sage: gain (i.e. positive behavior leads to posi-      Change is a dependent variable, Attitude is a me-
tive outcomes), non-gain (negative behavior pre-       diator of the relationship between the independent
vents positive outcomes), loss (negative behavior      and the dependent variables, and Self-Efficacy is a
leads to negative outcomes) and non-loss (posi-        moderator, namely, it explains the intensity ot the
tive behavior prevents negative outcomes) (Hig-        relation it points at. Intention Change was mea-
gins, 1997; Cesario et al., 2013). In a last phase     sures at both T1 and T2, Attitude was measured at
(Time 2 - T2), the effectiveness of the message in-    both T1 and T2, and Self-Efficacy was measured at
tervention was then evaluated with a second ques-      T1 only. Note that the time transversality (i.e. T1
tionnaire, to detect changes in participants’ Atti-    → T2) is implicit in the SEM depiction above.
3     Probabilistic model: Bayesian Network
Once the SEM is defined, we aim to translate
it into a probabilistic model, so as to obtain the
probability distributions needed for the learning
process. We resort to a graphical model, and in
particular to a Bayesian Network (BN, see Ben
Gal, 2007), namely a graph-based description of                Figure 3: Basic example of computation of Vπ in
both the observable and latent random variables in             a case where S = {s1 , s2 }. p1 , p2 , p3 are three
the model and their conditional dependencies. In               possible policies.
BNs, nodes represent the variables and edges rep-
resent dependencies between them, whereas the
lack of edges implies their independence, hence                (1 to 2); medium := (3 to 5); high := (6 to 7).
a simplification in the model. As a general rule,                 Our aim is to learn the Joint Probability Distri-
the joint probability of a BN can be inferred as               bution (JPD) of our model, as that would make us
follows:                                                       able to answer, through marginalizations and con-
                                                               ditional probabilities, any query about the model
                             N
                             Y                                 itself. The conditional probability distributions to
    P (X1 , . . . , XN ) =         P (Xi | parents(Xi )),      be learnt in the case in point are then the follow-
                             i=1
                                                               ing:
where X1 , . . . , XN are the random variables in                  • P (Item Ai), for i = 1, . . . , 8;
the model and parents(Xi ) indicate all the nodes
having an edge oriented towards Xi .                               • P (Item SEi), for i = 1, . . . , 8;
   In the case at hand, a temporal description of                  • P (Message Type);
the model, accounting for the time steps T1 and
                                                                   • P (Attitude T 1 | Item Ai, i = 1, . . . , 8);
T2, is necessary as well. For this purpose, we use
a Dynamic Bayesian Network (DBN, see Dagum                         • P (Self-Efficacy | Item SEi, i = 1, . . . , 8);
et al., 1992). The DBN thus obtained is shown in                   • P (Attitude T 2 | Item Ai, i = 1, . . . , 8,
Figure 2.                                                            Message Type, Self-Efficacy);
   Notice that the messages are only significant at
T2, as they have not been sent yet at T1. We gath-                 • P (Intention Change | Attitude T 2,
ered message in the one node Message Type, as-                       Self-Efficacy).
suming it can take four, mutually exclusive values.            The first three can be easily inferred from the raw
The mediator Attitude is measured at both time                 data as relative frequencies. As for the following
steps while the moderator Self-Efficacy is constant            four, even aggregating the 7 values as mentioned,
over time, as suggested in Section 2. Intention                a huge amount of data would still be necessary
Change has relevance at T2 only since, as we will              (38 ·24 ·3 = 314.928 subspaces for Attitude T2, for
mention in Section 5, it will be used to estimate a            instance). As conducting a psychological study on
reward function once the final time step is reached.           that amount of people would not be feasible, we
                                                               address the issue with an appropriate choice of the
4     Learning the BN                                          method. To allow using Maximum Likelihood Es-
The collected data are as follows. The analysis                timation (MLE) to learn the BN, we resort to the
was conducted on 442 interactants, divided in four             Noisy-OR approximation (see Oniśko, 2001). Ac-
groups, each one receiving a different type of mes-            cording to this, through a few appropriate changes
sages1 . The answers to the items of the ques-                 (not shown) to the graphical model, the number of
tionnaire always had a range of 7 values. How-                 subspaces can be greatly reduced (e.g. 3·2·3 = 18
ever, this induces a combinatory esplosion, mak-               for Attitude T2).
ing it impossible to cover all the subspaces (78 =
5.764.801 different combinations for Attitude, for
                                                               5    Reinforcement Learning: Markov
instance). We thus decide to aggregate: low :=                      Decision Problems
   1
     The original study included also a control group, which   The translation into a tool to be used for reinforce-
we do not consider here.                                       ment learning is obtained in the terms of Markov
Decision Processes (MDPs), see Fabiani et al.              In the POMDP framework, the agent’s choices
(2010).                                                    about how to behave are influenced by its belief
   Roughly speaking, in a MDP there is a finite            state and by the history. Thus, we define the
number of situations or states of the environment,         agent’s policy:
at each of which the agent is supposed to select an
action to take, thus inducing a state transition and                             π = π(bt , ht ),
obtaining a reward. The objective is to find a pol-        that we aim to optimize. To complete the picture,
icy determining the sequence of actions that gen-          we define the following functions to describe the
erates the maximum possible cumulative reward,             model evolution in time (the notation 0 indicates a
over time. However, due to the presence of latents,        reference to the subsequent time step):
in our case the agent is not able to have complete
                                                                           state-transition function:
knowledge about the state of the environment. In
                                                                   T : (s, a) 7→ P (s0 | s, a) := T (s0 , s, a);
such a situation, the agent must build its own esti-
mate about the current state based on the memory                             observation function:
of past actions and observations. This entails using              O : (s, a) 7→ P (o0 | a, s0 ) := O(o0 , a, s0 );
a variant of the MDPs, that is Partially Observable                             reward function:
Markov Decision Processes (POMDPs, see Kael-                         R : (s, a) 7→ E [r0 | s, a] := R(s, a).
bling 1998). We then define the following, with
                                                        These functions can be easily adapted to the
reference to the variables mentioned in Figure 2:
                                                        specifics of the case at hand. It can be seen that,
   S := {states} = {Attitude T 2, Self-Efficacy};       once the JPD derived from the DBN is completely
    A := {actions} = {ask A1, . . . , ask A8} ∪         specified, the reward is deterministic. In particu-
  {ask SE1, . . . , ask SE8} ∪ {G, N G, L, N L},        lar, it is computed by evaluating the changes in the
                                                        values for the latent Intention Change.
where Ai denotes the question for Item Ai,
                                                           As we are interested in finding an optimal pol-
SEi denotes the question for Item SEi and
                                                        icy, we now need to evaluate the goodness of each
G, N G, L, N L denote the action of sending Gain,
                                                        state when following a given policy. As there is
Non-gain, Loss and Non-loss messages respec-
                                                        no certainty about the states, we define the value
tively;
                                                        function as a weighted average over the possible
               Ω := {observations} =                    belief states:
{Item A1, . . . , Item A8, Item SE1, . . . , Item SE8}.                             X
                                                                Vπ (bt , ht ) :=        bt (st )Vπ (st , bt , ht ),
   Starting from an unknown initial state s0 (often                                  st
taken to be uniform over S, as no information is
available), the agent takes an action a0 , that brings  where Vπ (st , bt , ht ) is the state value function.
it, at time step 1, to state s1 , unknown as well.      The latter depends on the expected reward (and on
There, an observation o1 is made.                       a discount factor γ ∈ [0, 1] stating the preference
   The process is then repeated over time, until a      for  fast solutions):
goal state of some kind has been reached. Hence,
                                                                 Vπ (st , bt , ht ) :=R(st , π(bt , ht )) +
we can define the history as an ordered succession                    X
of actions and observations:                                        γ       T (st+1 ,st , π(bt , ht )) ∗
                                                                         st+1
       ht := {a0 , o1 , . . . , at−1 , ot } , h0 = ∅.      X
                                                                  O(ot+1 , π(bt , ht ),st+1 )Vπ (st+1 , bt+1 , ht+1 ).
                                                           ot+1
   As at all steps there is uncertainty about the ac-
tual state, a crucial role is played by the agent’s        Finally, we define the target of our seek, namely
estimate about the state of the environment, i.e. by       the optimal value function and the related optimal
the belief state. The agent’s belief at time step t,       policy, as:
denoted as bt , is driven by its previous belief bt−1           (
and by the new information acquired, i.e. the ac-                 V ∗ (bt , ht ) := maxπ Vπ (bt , ht ),
tion taken at−1 and observation made ot . We then                 π ∗ (bt , ht ) := argmaxπ Vπ (bt , ht ).
have:
                                                           It can be shown that the optimal value function in a
       bt+1 (st+1 ) = P (st+1 | bt , at , ot+1 ).          POMDP is always piecewise linear and convex, as
             Figure 4: Expansion of the policy tree. l, m, h stand for low, medium and high.


exemplified in Figure 3. In other words, the opti-        timal policies. These latter will be applied by the
mal policy (in bold in Figure 3) combines different       conversational agent in the interaction with each
policies depending on their belief state values.          specific user, to adapt both the sequence and the
   The next step is to use the POMDP to detect the        amount of questions to her/his personality profile
optimal policy, that is the sequence of questions to      and selecting the message which is most likely to
ask to the interactant, in order to draw her/his pro-     be effective.
file, hence the message to send, which maximizes
the effectiveness of the interaction. To this end, the    7   Conclusions and future work
contribution of the DBN is fundamental. From the
JPD associated, in fact, we construct the probabil-       In this work we explored the possibility of har-
ity distributions necessary to define the functions       nessing a complete and experimentally assessed
T , O, R that compose the value function.                 SEM, developed in the field of persuasion psy-
                                                          chology, as the basis for the reinforcement learn-
6   Policy from Monte Carlo Tree Search                   ing of a dialogue manager that drives a conversa-
It is evident from Figure 4, describing the full ex-      tional agent whose task is inducing healthier nu-
pansion of the policy tree for the case in point,         tritional habits in the interactant. The fundamen-
that the computational effort and power required          tal component of the method proposed is a DBN,
for a brute-force exploration of all possible com-        which is derived from the SEM above and acts like
binations is unaffordable.                                a predictor for the belief state value in a POMDP.
   Among all the policies that can be considered,            The main expected advantage is that, by doing
we want to select the optimal ones, thus avoid-           so, the RL agent will not need a time-consuming
ing coinsidering policies that are always underper-       period of training, possibly requiring the involve-
forming. In other words, with reference to Fig-           ment of human interactants, but can be trained ‘in
ure 3, we want to find Vp1 , Vp2 , Vp3 among those        house’ – at least at the beginning – and be released
of all possible policies, and use them to identify        in production at a later stage, once a first effec-
the optimal policy V ∗ .                                  tive strategy has been achieved through the DBN.
   To accomplish this, we select the Monte Carlo          Such method still requires an experimental valida-
Tree Search (MCTS) approach, see Chaslot et al.           tion, which is the current objective of our working
(2008), due to its reliability and its applicability to   group.
computationally complex practical problems. We
adopt the variant including an Upper Confidence           Acknowledgments
Bound formula, see Kocsis et al. (2006). This
method combines exploitation of the previously            The authors are grateful to Cristiano Chesi of
computed results, allowing to select the game ac-         IUSS Pavia for his revision of an earlier version
tion leading to better results, with exploration of       of the paper and his precious remarks. We also
different choices, to cope with the uncertainty of        acknowledge the fundamental help given by Re-
the evaluation. Thus, using Vπ (st , bt , ht ) as de-     becca Rastelli, during her collaboration to this re-
fined before to guide the exploration, the MCTS           search.
method reliably converges (in probability) to op-
References                                                Gupta, Sumeet & W. Kim, Hee. 2008. Linking
                                                            structural equation modeling to Bayesian networks:
Allen, J., Ferguson, G., & Stent, A. 2001. An archi-        Decision support for customer retention in virtual
  tecture for more realistic conversational systems. In     communities. European Journal of Operational Re-
  Proceedings of the 6th international conference on        search. 190. 818-833.
  Intelligent user interfaces (pp. 1-8). ACM.
                                                          Heckerman, David. 1995. A Bayesian Approach to
Anderson, Ronald D. & Vastag, Gyula. 2004.                  Learning Causal Networks.
  Causal modeling alternatives in operations re-
  search: Overview and application. European Jour-        Higgins, E.T. 1997. Beyond pleasure and pain. Amer-
  nal of Operational Research. 156. 92-109.                 ican Psychologist, 52, 1280-1300.
                                                          A. Howard, Ronald. 1972. Dynamic Programming
Auer, Peter & Cesa-Bianchi, Nicolò & Fischer, Paul.
                                                            and Markov Process. The Mathematical Gazette.
  2002Kocsis, Levente & Szepesvári, Csaba. 2006.
                                                            46.
  Bandit Based Monte-Carlo Planning. Finite-time
  Analysis of the Multiarmed Bandit Problem. Ma-          Pack Kaelbling, Leslie & Littman, Michael & R. Cas-
  chine Learning. 47. 235-256.                              sandra, Anthony. 1998. Planning and Acting in Par-
                                                            tially Observable Stochastic Domains. Artificial In-
Bandura, A. 1982. Self-efficacy mechanism in human          telligence. 101. 99-134.
  agency. American Psychologist, 37, 122-147.
                                                          Kocsis, Levente & Szepesvári, Csaba. 2006. Bandit
Baron, Robert A. & Byrne, Donn Erwin & Suls, Jerry          Based Monte-Carlo Planning. Machine Learning:
  M. 1989. Exploring social psychology, 3rd ed.             ECML 2006. Springer Berlin Heidelberg. 282-293.
  Boston, Mass.: Allyn and Bacon. 0205119085.
                                                          Lai, T.L & Robbins, Herbert. 1985. Asymptotically Ef-
Ben Gal I. 2007. Bayesian Networks. Encyclopedia of         ficient Adaptive Allocation Rules. Advances in Ap-
  Statistics in Quality and Reliability. John Wiley &       plied Mathematics. 6. 4-22.
  Sons.                                                   Liu, Bing. 2018. Learning Task-Oriented Dialog with
                                                            Neural Network Methods. PhD thesis.
Bertolotti, M., Carfora, V., & Catellani, P. 2019. Dif-
  ferent frames to reduce red meat intake: The moder-     Murphy, Kevin. 2012. Machine Learning: A Proba-
  ating role of self-efficacy. Health Communication,       bilistic Perspective. The MIT Press. 58.
  in press.
                                                          Pearl Judea. 1988. Probabilistic Reasoning in In-
Carfora, V., Bertolotti, M., & Catellani, P. 2019. In-      telligent Systems: Networks of Plausible Inference.
  formational and emotional daily messages to reduce        Representation and Reasoning Series (2nd printing
  red and processed meat consumption. Appetite,             ed.). San Francisco, California: Morgan Kaufmann.
  141, 104331.
                                                          Oniśko, Agnieszka & Druzdzel, Marek J. & Wasyluk,
Cesario, J., Corker, K. S., & Jelinek, S. 2013. A self-     Hanna. 2001. Learning Bayesian network parame-
  regulatory framework for message framing. Journal         ters from small data sets: application of Noisy-OR
  of Experimental Social Psychology, 49, 238-249.           gates. International Journal of Approximate Rea-
                                                            soning. 27.
Chaslot, Guillaume & Bakkes, Sander & Szita, Istvan       Silver, David & Veness, Joel. 2010. Monte-Carlo
  & Spronck, Pieter. 2008. Monte-Carlo Tree Search:          Planning in Large POMDPs. Advances in Neural
  A New Framework for Game AI. Bijdragen.                    Information Processing Systems. 23. 2164-2172.
Dagum, Paul and Galper, Adam and Horvitz, Eric.           Matthijs T. J. Spaan. 2012. Partially Observable
  1992. Dynamic Network Models for Forecasting.            Markov Decision Processes. In: Reinforcement
  Proceedings of the Eighth Conference on Uncer-           Learning: State of the Art. Springer Verlag. 387-
  tainty in Artificial Intelligence.                       414.

Dagum, Paul and Galper, Adam and Horvitz, Eric and        Sutton, Richard & G. Barto, Andrew. 1998. Reinforce-
  Seiver, Adam. 1999. Uncertain reasoning and fore-         ment Learning: An Introduction. IEEE transactions
  casting. International Journal of Forecasting.            on neural networks / a publication of the IEEE Neu-
                                                            ral Networks Council. 9. 1054.
De Waal, Alta & Yoo, Keunyoung. 2018. Latent Vari-
                                                          Wright, Sewall. 1921. Correlation and causation.
  able Bayesian Networks Constructed Using Struc-
                                                           Journal of Agricultural Research. 20. 557–585.
  tural Equation Modelling. 2018 21st International
  Conference on Information Fusion. 688-695.              Young, Steve & Gasic, Milica & Thomson, Blaise &
                                                            Williams, Jason. 2013. POMDP-based statistical
Fabiani, Patrick & Teichteil-Königsbuch, Florent.           spoken dialog systems: A review. Proceedings of
  2010. Markov Decision Processes in Artificial In-         the IEEE, 101. 1160-1179.
  telligence. Wiley-ISTE.