<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of Multi-party Virtual Agents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>RESHMASHREE B. KANTHARAJU</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sorbonne Université CATHERINE PELACHAUD</string-name>
          <email>catherine.pelachaud@upmc.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CNRS - ISIR</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sorbonne Université</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Additional Key Words and Phrases: Group Cohesion</institution>
          ,
          <addr-line>Virtual Agent Platform, Evaluation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Authors' addresses: Reshmashree B. Kantharaju, ISIR, Sorbonne Université</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Council of Coaches project has developed a platform that aims at providing tailored and personalized virtual coaching for ageing people to support them in improving their health and well-being. This paper presents the results of the user evaluations of the technical prototype that we conducted.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>2.1 System</title>
      <p>
        The initial prototype used for this evaluation implements a small dialogue manager that is able to steer the scripted
dialogue and control the possible user input and user interface [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The dialogue manager is responsible for selecting the
next move in the dialogue, controlling the user interface and listening to the feedback provided by the ASAP realizer. The
Greta and ASAP platforms are used for multimodal behaviour generation and for visualising Embodied Conversational
Agents (ECA) in the Unity3D engine. A system with the platform pre-installed is used for the experiment.
Joint workshop on Games-Human Interaction (GHItaly21) and Multi-party Interaction in eXtended Reality (MIXR21), July 12, 2021, Bolzano, Italy
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
https://www.agents-united.org
      </p>
      <p>(1) Francois (Diet Coach): proposed a healthy recipe based on the user’s dietary preferences that were collected
during the interaction.
(2) Olivia (Physical activity Coach): recommended the user go for a walk around the block once a day around meal
time.
(3) Emma (Social Coach): suggested the user to have a friend or a family member accompany them during the walk.
(4) Carlos (Peer): provided supportive dialogue emphasizing the expertise of the coaches and the eficacy of their
coaching.</p>
    </sec>
    <sec id="sec-2">
      <title>2.3 Questionnaire</title>
      <p>We made use of the Godspeed questionnaire to measure the Animacy, Anthropomorphism, Likeability and the Perceived
intelligence of our coaching agents. We did not utilise the Perceived safety questionnaire, as we did not expect the
discussed subject to have a strong emotional impact with regards to anxiety, agitation, or surprise. Further, we modified
the System Usability Scale questionnaire to suit our study. We removed the questions related to use of the product
in terms of a new technology since we used a simple computer interface. However, we still retained some questions
related to the ease of use. Finally, we asked two general open-ended questions about the participant’s opinion of the
system and the agents to capture the overall impression, and to find out if they would recommend the system to others.
2.4</p>
    </sec>
    <sec id="sec-3">
      <title>Procedure</title>
      <p>The experiment involved one user interaction that lasted approximately five minutes. The experiment was conducted
in English. The scenario consisted of four coaches, where two coaches are ASAP agents and two Greta agents. Every
participant interacted with the system individually. The participants were first asked to read the information letter
and sign the informed consent form, as well as ask any questions they had. A brief description of the experiment and
the tasks to be performed by the participant was explained. During the interaction the virtual coaches introduced
themselves and provided an interactive coaching session on healthy weight management. After the interaction, the
participants were asked to answer the questionnaire.
2.5</p>
    </sec>
    <sec id="sec-4">
      <title>Participants</title>
      <p>The prototype was setup at Sorbonne University and the University of Twente. We collected responses from 7 and 13
participants respectively. In total we had 20 participants, with 40% being female (n=8) and 60% being male (n=12). A
total of 30% of the participants were below 55 years old while 70% were aged 55 years or older, thus falling within the
primary target population for Council of Coaches. A total of 75% of the participants had never interacted with a virtual
agent prior to this study.
2.6</p>
    </sec>
    <sec id="sec-5">
      <title>Results and Discussion</title>
      <p>The results indicate that participants rated the agents high on likability (m = 3.67) and perceived intelligence (m =
3.63), but did not have high score on anthropomorphism (m = 2.76) and animacy (m = 2.93). Since 70% of population
(a) Study 1</p>
      <p>(b) Study 2
were above 55 years, their expectations of agents were probably a bit technically unrealistic. Furthermore, we need to
consider the fact that 75% of them had never interacted with a virtual agent before and their perception was probably
influenced by media (films).</p>
      <p>The opinion of participants about the system and agents was mixed. While several participants mentioned enjoying
using the system, and felt comfortable, few participants remarked that the gaze behaviour of agents was not genuine.
This could be related to the remarks about the interaction not being human enough yet. Thus, we aim to develop a
model to improve the gaze behaviour of agents during the interaction.</p>
      <p>The agents were particularly perceived to be very friendly (m = 4.05) and kind (m = 3.85). 70% of the participants
thought that the system was well integrated. 55% of the participants indicated that they would use the system frequently
and 50% felt confident using the system. However, 35% of the participants reported that they would need technical
assistance. Even though the interface was simple (clicking buttons), the overall system might have looked complicated
for the older participants. If it was just launching an app or a browser, the user might have not felt that they required
assistance. Overall, 80% of the participants said that they would recommend the system to their friends and family.
3</p>
    </sec>
    <sec id="sec-6">
      <title>EVALUATION STUDY 2: COHESIVE GROUP MODEL</title>
      <p>
        Group cohesion is prominent when the main goal of the group is decision making or problem solving. Cohesion
describes the tendency of group members’ shared bond/attraction that drives the members to stay together and to want
to work together [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It is a group phenomenon that emerges over time in teams [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and a key variable for efective
team performance [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Hence, we believe cohesion is an important phenomenon for our multiparty conversation model.
In this evaluation study we aim to understand the perception of cohesive behaviours of the virtual coaches by the
participants. In particular, we are interested in evaluating the perceived level of cohesiveness of the group, the trust in
the agents and their persuasiveness.
3.1
      </p>
    </sec>
    <sec id="sec-7">
      <title>System</title>
      <p>
        The main aim of a group behaviour model is to generate the non-verbal behaviours for all the agents present in the
discussion based on their roles i.e., listener or speaker as the conversation proceeds. In this work, the goal of this
component is to enable the agents to display cohesive group behaviour. In order to model group cohesion in multi-party
interactions, we first annotated non-verbal behaviours i. e., gaze behaviour, facial expressions, head movement and
laughter and perceived cohesion level on 2-min video segments of the Patient Consultation Corpus (PCC) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The
annotated segments were grouped as either low or high cohesion based on a threshold. A one-way Anova revealed that
mutual gaze, laughter and head nods are prominent cues frequently observed in high cohesion segments. Moreover, gaze
and laughter (smile) information was used to automatically recognize cohesive video segments, reaching an accuracy
higher than 75%. We can infer that these two cues play a very important role in perception of cohesion by external
observers. For this study, we make use of head nods in addition to these two behaviours to display a cohesive group of
agents. We make use of an LSTM network using Keras with optimised hyperparameters to predict speaker and listner
behaviours. The input to the model is the one hot encoded gaze direction for each participant along with the binary
encoding of smile and head nods. The output is translated into a BML file that needs to be executed by each agent. The
model is trained on the cohesive video segments only. The model generates the gaze target every 30 frames and a BML
ifle is selected. Also, this network triggers when an agent has to display a smile or head nod.
3.2
      </p>
    </sec>
    <sec id="sec-8">
      <title>Stimulus</title>
      <p>
        The scenario consists of the same four coaches (virtual agents)(see Sec. 2.2) interacting with each other with a diferent
appearance. Three diferent dialogue samples were developed on weight loss, stress management and sleep [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For this
evaluation study we choose only one topic, i.e., stress management based on Bales’ model.
      </p>
    </sec>
    <sec id="sec-9">
      <title>3.3 Questionnaire</title>
      <p>We make use of two pre-study questionnaires and three post-study questionnaires. The pre-study questionnaires
measure the Negative Attitude towards Robots Scale (NARS) adapted to virtual agents (4-items) and persuadability
of the user (5-items). The post-questionnaire measures the cohesiveness (4-items), the credibility (3-items) and the
persuasiveness (3-items) of the group.
3.4</p>
    </sec>
    <sec id="sec-10">
      <title>Design</title>
      <p>
        The goal of the study was to understand the impact of cohesive group of agents. We have developed a between-group
study with two groups. The first group interacted with agents that display behaviours generated by a random behaviour
generator. The second group interact with agents that display cohesive group behaviours generated by our model. The
behaviours we focused on are gaze, smile and head nods. The agent appearance and dialogue content remain the same
for both the groups. Based on the results from our previous study on persuasiveness, we assigned the role of providing
advice to an older authoritative agent while the supportive coach was assigned to a younger peer coach [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We also
made use of vicarious persuasion techniques where one agent presents an argument to persuade another agent while
indirectly persuading the user. The dialogue on stress management lasted for about 3 minutes.
      </p>
      <p>The study was initially planned to take place in a laboratory setting. However due to the recent health regulations,
respecting the social distancing, we have modified the experimental setup. Since the technical setup is not on-the-fly
and requires several software platforms to be installed along with their licensing, asking the participants (age group
above 50) would have been a challenging task. Therefore, we modified our setup to present pre-recorded videos to the
participant to imitate interactions in real-time. We used a survey platform to generate the flow of the conversation,
provide options for the user to provide their response and based on the response selected the next video to be played.
3.5</p>
    </sec>
    <sec id="sec-11">
      <title>Procedure</title>
      <p>The participant read a general instruction form and provided their consent to take part in the study. The participants
iflled in the pre-study questionnaire. An introductory video of a virtual agent was presented to familiarise the user
to the virtual agent, their behavioural capabilities and the type of interaction. We then started the session where the
user is asked to imagine a situation and then interact with the group of agents. The user is presented with a recording
of the interaction and prompted for a response when required. Once the user selects an option, the video interaction
continues. Once the interaction is complete, the user is notified and the post-study questionnaire is displayed and we
collected basic demographic information.
3.6</p>
    </sec>
    <sec id="sec-12">
      <title>Participants</title>
      <p>The participants were recruited online using a survey hosting platform named Prolific. We had set three specifications
to recruit participants, i.e., aged above 50, proficient in English and has been diagnosed with chronic disease. In total
we had 32 participants taking part in our evaluation study where 10 participants were in the age group of 51-60 and 22
were in the age group above 60. 36% of the participants were male while 64% were female.
3.7</p>
    </sec>
    <sec id="sec-13">
      <title>Results and Discussion</title>
      <p>A one-way Anova for used for the analysis of the responses. The perceived level of cohesion was slightly higher for
the condition using our model in comparison to random behaviour model for all the participants (n=30). However,
the diference was not statistically significant (p &gt; 0.05). We calculated the persuadability score of each participant
and retained those with a score higher than three. In total we had 16 participants equally distributed between the two
conditions who reported to be persuadable. Results indicate that the perceived level of cohesion was higher for the
videos generated by our model (m=4.03) than the random behaviour model (m=3.53) and the diference was slightly
significant (p=0.1). There was no statistically significant diference between the two conditions for the perceived level
of trust. We computed the mean score of trust for only persuadable participants in both conditions. Even though the
rating was higher for the condition using our model, the diference was not statistically significant. We further grouped
the participants based on NARS questionnaire, and we did not find any significant results. Finally, the perceived level of
persuasiveness was rated equally for both the conditions with no diference.</p>
      <p>In this evaluation study we tried to measure the perceived level of cohesion and how this in turn aspects the trust in
the agents and their persuasiveness. In order to to do this we designed an online evaluation study with two conditions.
We used our model to generate cohesive behaviours for one condition and for the other we used a random behaviour
model. We found there was no significant diference in the perceived level of cohesion for both the condition for all
the participants. However, when we filtered out participants based on their persuadability score we found a slightly
significant diference where participants found the condition using our model to be highly cohesive group of virtual
agents. The study had to be done online with pre-recorded videos which hindered the quality of videos. Even though
we tried our best to record high-quality videos, we are not sure whether the participants were able to watch them in
the same setting. Since the diferences in a listener executing a smile or nod is very subtle the participants might have
missed it. Also, the environmental conditions could afect the results which we were not able to control. Regarding the
perceived persuasiveness, we found there was no significant diference. This could be attributed to the fact that we
used the same dialogue content and agents for both the conditions and only the non-verbal behaviours were diferent.
Some participants found the automatic text-to-speech generated audio to be very artificial which could have afected
their rating. Overall, the participants found the study to be quite interesting and an enjoyable experience.
We described two evaluation studies conducted in the context of the Council of Coaches project. The first study focuses
on the evaluation of the technical prototype. The results indicate that participants rated the agents high on likability
and perceived intelligence, however, the gaze model for group interaction of the agents need to be developed. Further,
the users enjoyed the interaction and said they would recommend the system to others. Using the feed back received
from this study, we developed a group behaviour model that handles cohesive non-verbal behaviour generation for
the group of agents with focus on gaze, smile and head nods. Results indicated that the perceived level of cohesion
was slightly higher than the random behaviour model. Overall, the participants found the study to be quite interesting
and an enjoyable experience. In future, we aim to expand the cohesive behaviour generation model to include other
non-verbal behaviours.</p>
    </sec>
    <sec id="sec-14">
      <title>ACKNOWLEDGMENTS</title>
      <p>This project has received funding from the European Union’s Horizon 2020 research and innovation program under
grant agreement number 769553. This result only reflects the authors’ views and the EU is not responsible for any use
that may be made of the information it contains. We are grateful to our project collaborators at University of Dundee,
University of Twente and Universitat Politècnica de València who were mainly responsible for development of the
technical prototype.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Daniel</surname>
            <given-names>J Beal</given-names>
          </string-name>
          , Robin R Cohen,
          <string-name>
            <given-names>Michael J Burke,</given-names>
            and
            <surname>Christy L McLendon</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Cohesion and performance in groups: a meta-analytic clarification of construct relations</article-title>
          .
          <source>Journal of applied psychology 88</source>
          ,
          <issue>6</issue>
          (
          <year>2003</year>
          ),
          <fpage>989</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Tessa</given-names>
            <surname>Beinema</surname>
          </string-name>
          , Harm op den Akker, Lex van Velsen,
          <string-name>
            <given-names>and Hermie</given-names>
            <surname>Hermens</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Tailoring coaching strategies to users' motivation in a multi-agent health coaching application</article-title>
          .
          <source>Computers in Human Behavior</source>
          <volume>121</volume>
          (
          <year>2021</year>
          ),
          <fpage>106787</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Milly</given-names>
            <surname>Casey-Campbell and Martin L Martens</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Sticking it all together: A critical assessment of the group cohesion-performance literature</article-title>
          .
          <source>International Journal of Management Reviews</source>
          <volume>11</volume>
          ,
          <issue>2</issue>
          (
          <year>2009</year>
          ),
          <fpage>223</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Gerwin</given-names>
            <surname>Huizing</surname>
          </string-name>
          , Brice Donval, Mukesh Barange, Reshmashree Kantharaju, and
          <string-name>
            <given-names>Fajrian</given-names>
            <surname>Yunus</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>of deliverable Final prototype description and evaluations of the virtual coaches</article-title>
          . (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Gerwin</given-names>
            <surname>Huizing</surname>
          </string-name>
          , Randy Klaassen, and
          <string-name>
            <given-names>Dirk</given-names>
            <surname>Heylen</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Designing Efective Dialogue Content for a Virtual Coaching Team Using the Interaction Process Analysis and Interpersonal Circumplex Models</article-title>
          . In Persuasive Technology. Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Reshmashree</surname>
            <given-names>B Kantharaju</given-names>
          </string-name>
          , Dominic De Franco, Alison Pease, and
          <string-name>
            <given-names>Catherine</given-names>
            <surname>Pelachaud</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Is two better than one? Efects of multiple agents on user persuasion</article-title>
          .
          <source>In Proceedings of the 18th International Conference on Intelligent Virtual Agents</source>
          .
          <fpage>255</fpage>
          -
          <lpage>262</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Reshmashree</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Kantharaju</surname>
            , Alison Pease,
            <given-names>Dennis</given-names>
          </string-name>
          <string-name>
            <surname>Reidsma</surname>
            ,
            <given-names>Catherine</given-names>
          </string-name>
          <string-name>
            <surname>Pelachaud</surname>
          </string-name>
          , Mark Snaith, Merijn Bruijnes, Randy Klaassen, Tessa Beinema, Gerwin Huizing, Donatella Simonetti,
          <source>Dirk Heylen, and Harm op den Akker</source>
          .
          <year>2019</year>
          .
          <article-title>Integrating Argumentation with Social Conversation between Multiple Virtual Coaches</article-title>
          .
          <source>In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents (Paris, France) (IVA '19)</source>
          .
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <fpage>203</fpage>
          -
          <lpage>205</lpage>
          . https://doi.org/10.1145/3308532.3329450
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Jessica</surname>
            <given-names>M Santoro</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Aurora J Dixon</given-names>
            ,
            <surname>Chu-Hsiang Chang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Steve</surname>
          </string-name>
          WJ Kozlowski.
          <year>2015</year>
          .
          <article-title>Measuring and monitoring the dynamics of team cohesion: Methods, emerging tools, and advanced technologies</article-title>
          . In Team cohesion:
          <article-title>Advances in psychological theory, methods and practice</article-title>
          . Emerald Group Publishing Limited.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Snaith</surname>
          </string-name>
          , Nicholas Conway,
          <string-name>
            <given-names>Tessa</given-names>
            <surname>Beinema</surname>
          </string-name>
          , Dominic De Franco, Alison Pease, Reshmashree Kantharaju, Mathilde Janier, Gerwin Huizing,
          <source>Catherine Pelachaud, and Harm op den Akker</source>
          .
          <year>2021</year>
          .
          <article-title>A multimodal corpus of simulated consultations between a patient and multiple healthcare professionals. Language resources and evaluation (</article-title>
          <year>2021</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>