<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>L. Grassi); carmine.recchiuto@dibris.unige.it (C. T. Recchiuto);
antonio.sgorbissa@unige.it (A. Sgorbissa)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multiparty Verbal Interaction Between Humans and Artificial Agents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lucrezia Grassi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carmine Tommaso Recchiuto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Sgorbissa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Genoa</institution>
          ,
          <addr-line>Via All'Opera Pia 13, 16145, Genoa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>The study of verbal interaction between multiple humans and robots is an almost unexplored research ifeld. This kind of interaction has been primarily analyzed in the literature focusing on cooperation to achieve a common task or on more technical aspects such as active speaker recognition. The presented work proposes a holistic approach to solve the problem: a cloud architecture that allows social robots and artificial agents to interact verbally with a group of people. The system can recognize the active speaker and decide who to address based on the developed policies while also correctly keeping track of the conversation state.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Autonomous Conversation</kwd>
        <kwd>Multiparty Interaction</kwd>
        <kwd>Human-Robot Interaction</kwd>
        <kwd>Social Robotics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Social Robotics aims to develop robots that can provide physical and cognitive support in
a socially interactive way. During the interaction, one of the main issues is the knowledge
acquisition problem. To solve this problem, the system should have the capability of learning
through interaction. The agent should recognize new relevant information, update its knowledge
appropriately, and use new information to adapt its behavior when interacting with the user.
This problem, which is already very challenging, is made more complex when a social agent
communicates with multiple people simultaneously. In this context, the robot should not
only be able to acquire knowledge but also to recognize its interlocutors to correctly associate
relevant information with the person it relates to. Moreover, such a system should emulate the
conversation patterns that typically emerge when more humans interact with each other [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
a problem that has almost been ignored in the Social Robotics literature where human-robot
interaction is typically one-on-one. Among the problems to be addressed, the system will have
to keep track of the conversation state with diferent users and recognize who is talking to
provide the most appropriate response [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Currently, there are few robots capable of autonomously interacting with multiple users at
the same time, although this type of interaction frequently occurs for humans. In a multiparty
spoken dialogue system, such as the one described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the agent can discriminate between
multiple users using the information provided by a Kinect. However, the agent’s conversational
capabilities are very basic and limited, and long-term engaging and natural conversation between
multiple parties is still an open problem. In addition, the agent is not able to engage multiple
users simultaneously in the conversation, but only one at a time. The tracking and fusion
aspects of multiparty interactions with artificial agents are studied in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], where a system with
a life-size virtual agent and a social robot is introduced. The system focuses only on a user
entry/exit mechanism with re-identification of users, but not on the conversation, and can
currently keep accurate track of only two users. The literature suggests that voice plays an
important role when trying to determine who is the talking person: several techniques for
speech recognition have already been studied [5] and some approaches work even in noisy and
unconstrained conditions [6].
      </p>
      <p>
        On the other side, the dynamics existing in group conversations have been deeply examined by
researchers in the field of psychology. Several studies have found that increasing the number of
people in a conversation creates systematic challenges for speakers and listeners, a phenomenon
that is called “the many minds problem” [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Specifically, when more people interact with each
other, the basic mechanisms of a conversation are altered, such as “turn-taking” (i.e., a type of
organization in conversation in which participants speak one at a time in alternating turns),
“floor-time” or “air-time” (i.e., the time participants use to speak), and the type of feedback
listeners provide to the speaker.
      </p>
      <p>The presented work:
• introduces a software architecture to empower a robot with the capability of recognizing
the users participating in a conversation;
• implements diferent strategies to control the dynamics of a group conversation, deciding
which speaker to address, based on data gathered during the interaction.</p>
      <p>Section 2 briefly describes the architecture of the cloud system. Section 3 presents the results
of the experiments performed to assess the performance of the system in terms of average
response time.</p>
    </sec>
    <sec id="sec-2">
      <title>2. A Cloud Architecture for Multiparty Verbal Interaction</title>
      <p>The cloud system for multiparty interaction has been developed starting from CAIR, a cloud
software architecture developed for autonomous conversation [7], [8], [9]. In brief, CAIR is
composed of two web services: the Dialogue Manager service which manages the dialogue and
analyzes the user sentence to recognize the intention of talking about a specific topic, and the
Plan Manager service which recognizes the intention of the user to make the agent execute
a specific action. To provide appropriate answers and plans, the server exploits an Ontology
containing all the topics, keywords, sentences, and plans used during the interaction with the
user, as described in [10], [11]. These components may be observed in Figure 1, embedded in
the server. A client (i.e., the software controlling the robot) can perform requests to the server
using REST APIs, by providing to the cloud server the sentence pronounced by the user along
with information about the status of the conversation [7].</p>
      <p>The “red” elements in Figure 1 are the ones that allow an efective multiparty interaction. All
the requests that arrive to the cloud are managed by the Hub service, which oversees forwarding
them to the Dialogue Manager and the Plan Manager services. Moreover, the client has been
expanded with two new services: Registration and Audio Recorder services. The Registration
service is called every time a registration procedure is started. Suppose a new user wants to
be recognized by the system: they simply have to trigger the registration procedure, at any
moment of the conversation, by saying a sentence such as “Registration" or “Learn my voice"
and the system will associate a new profile ID to the user, who will be asked to provide their
name and gender and to talk for 20 seconds to complete the enrolment. The Audio Recorder
service starts acquiring the audio when the Root Mean Square (RMS) of noise exceeds a certain
threshold. This service sends the audio pieces to the Speech Recognition API to obtain the
transcription, and to the Speaker Recognition API to obtain the ID of the corresponding speaker
(if registered). After a final silence exceeding a certain threshold, the service returns an XML
string containing the transcribed pieces of text, each tagged with the ID of the corresponding
speaker. Eventually, the client sends the string to the Hub, along with the client state. Let us
specify that the exchange of messages between the client and the Hub proceeds until one of the
users decides to terminate the interaction by saying a predefined sentence such as “Goodbye"
or “Disconnect".</p>
      <p>Also, the client state has been expanded as it now contains statistics related to the speakers,
such as a matrix containing the probability that a speaker talks after another, a matrix with the
number of times a speaker talked after another in the same or successive turn, the total number
of turns of each user, the average topic distance between speakers, the a priori probability
that a speaker talks, a moving window keeping track of the turns, and other information. The
moving window, stored on the client device, is a fundamental element of the state: it contains
information about the conversation turns of the last  “active” minutes. For each turn, the
moving window stores the ID of the speaker, the speaking time, and the number of words said.
If the sum of the speaking times of the turns in the moving window exceeds  , the first turn is
removed and the latest one is added (FIFO queue).</p>
      <p>The information contained in the moving window has been used to develop two control
policies, based on the analysis of group dynamics: the “dominant” policy and the “submissive”
policy. The policies are implemented as functions that take as input the data contained in the
moving window and output the speaker to address. The first policy recognizes and addresses
the dominant user among the group of people interacting with the robot, while the second
one recognizes and addresses the user who participates less in the conversation (submissive).
Participation in the conversation is measured through a weight  that accounts for both
speaking time and number of words, as they turned out to be the most relevant indicators to
detect dominance [12]. To compute  for each speaker , the percentage of their speaking
time () and the number of words () in the moving window should first be measured. Then,
 is computed as:</p>
      <p>=  1 +  2,
where  1 and  2 are two gains that indicate how much importance is given to the speaking
time and to the number of words when determining the dominance. The addressed speaker
when applying the “dominant” policy will be:
[ ]
ℎ</p>
      <p>= ()
while the addressed speaker when applying the “submissive” policy will be:
[]
ℎ</p>
      <p>= ().</p>
      <p>The third policy developed is the “community” policy. This policy is based on the idea
that it is possible to identify sub-groups (i.e., communities) among the people in a group. To
identify the communities, we use a matrix containing the probability that a speaker talks after
another. Such a matrix is transformed into an undirected graph where the nodes represent the
speakers, and the probabilities are the weights of the edges. The Louvain algorithm is then
applied to the weighted graph to obtain the best partition of the nodes in communities [13].
The algorithm starts from a singleton partition in which each node is in its community. Then,
it moves individual nodes from one community to another to find a partition. Based on this
partition, the algorithm creates an aggregate network and moves individual nodes in such a
network. These steps are repeated until the quality cannot be further increased. Once the best
partition has been obtained, the result of the algorithm is used by the policy to address a random
speaker of a diferent community at every turn. The policy aims to control the conversation by
always maintaining a singleton partition and avoiding having speakers divided into sub-groups.
Let us specify that the policy that the system should use can be chosen before starting the
interaction.
(1)
(2)
(3)</p>
    </sec>
    <sec id="sec-3">
      <title>3. Preliminary Results</title>
      <p>An example of the interaction of multiple users with the described system may be observed in
this video1. Moreover, an experimental protocol to assess the impact of the developed policies
has already been defined and approved by the ethical committee. Participants will be divided into
four groups (a control group and three experimental groups) and they will have to participate
in a conversation with the robot. During the experiments, the robot will assume the role of a
“moderator” applying the developed policies. Data gathered during the experiments will allow
us to determine how the diferent policies impact participants’ perceptions of the robot and the
overall quality of the conversation.</p>
      <p>Also, preliminary tests have been carried out to assess the capability of the system to deal
with multiple client devices simultaneously. In particular, we performed the Baseline test to
evaluate the performance of the system in terms of average response time (i.e., the diference
between the time when the request was sent by the client and the time when the response was
fully received). To do this, we considered four diferent payloads, each containing the client
sentence and a dialogue client state of diferent sizes (from empty to full), to understand the
impact of the request data size on the response time. For each of these scenarios, 30 requests
spaced five seconds apart were performed by a single thread/client. From the results reported in
Figure 2, it can be observed that, even with the maximum payload, the average response time is
still very low (within 200 ms).</p>
      <p>As the objective is to empower a variety of devices with the ability to hold a long-term
conversation with one or more users, it is fundamental that the system can manage
contemporary connections from a growing number of clients. For this reason, we also performed the
Scalability test to assess how the average response time increases with a growing number of
requests. The test was carried out by simulating an increasing number of N users performing
requests simultaneously, using the greatest request payload. The established threshold for these
experiments is one second, which is below the delay reported in experiments about people’s
1https://www.youtube.com/watch?v=TpCGqFZLN4k
perception during a dialogue with a conversational system [14, 15]. Setting a lower threshold
arranges for variations that can be due to the load, the network performance, or the additional
time required to perform the speech-to-text transcription. This ensures higher satisfaction
during the conversation. Keeping this in mind, the results of the Scalability test, shown in Figure
3, revealed that the system can support up to 20 simultaneous requests without negatively
afecting the user’s perception.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>The paper presented the architecture of a cloud system allowing robots and other devices to
verbally interact with multiple people simultaneously. The work also presented and discussed
the results of experiments aimed at assessing the performance of the system in terms of average
response speed. These preliminary findings provided us with the basis to size the system,
paving the way to a sustainable solution for verbal interaction with low-cost robots and other
intelligent devices.
with a virtual character and a social robot, in: SIGGRAPH Asia 2014 Autonomous Virtual
Humans and Social Robot for Telepresence, 2014, pp. 1–7.
[5] J. P. Campbell, Speaker recognition: A tutorial, Proceedings of the IEEE 85 (1997) 1437–
1462.
[6] T. Kinnunen, H. Li, An overview of text-independent speaker recognition: From features
to supervectors, Speech communication 52 (2010) 12–40.
[7] L. Grassi, C. T. Recchiuto, A. Sgorbissa, Cloud services for social robots and artificial
agents, The 8th Italian Workshop on Artificial Intelligence and Robotics - AIRO 2021
(2021).
[8] L. Grassi, C. T. Recchiuto, A. Sgorbissa, Sustainable verbal and non-verbal human-robot
interaction through cloud services, arXiv preprint arXiv:2203.02606 (2022).
[9] C. Recchiuto, L. Gava, L. Grassi, A. Grillo, M. Lagomarsino, D. Lanza, Z. Liu, C.
Papadopoulos, I. Papadopoulos, A. Scalmato, et al., Cloud services for culture aware conversation:
Socially assistive robots and virtual assistants, in: 2020 17th International Conference on
Ubiquitous Robots (UR), IEEE, 2020, pp. 270–277.
[10] C. T. Recchiuto, A. Sgorbissa, A feasibility study of culture-aware cloud services for
conversational robots, IEEE Robotics and Automation Letters 5 (2020) 6559–6566.
[11] L. Grassi, C. T. Recchiuto, A. Sgorbissa, Knowledge-grounded dialogue flow management
for social robots and conversational agents, International Journal of Social Robotics (2022)
1–21.
[12] M. S. Mast, Dominance as expressed and inferred through speaking time: A meta-analysis,</p>
      <p>Human Communication Research 28 (2002) 420–450.
[13] X. Que, F. Checconi, F. Petrini, J. A. Gunnels, Scalable community detection with the louvain
algorithm, in: 2015 IEEE International Parallel and Distributed Processing Symposium,
IEEE, 2015, pp. 28–37.
[14] Z. Peng, K. Mo, X. Zhu, J. Chen, Z. Chen, Q. Xu, X. Ma, Understanding user perceptions
of robot’s delay, voice quality-speed trade-of and gui during conversation, in: Extended
Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, CHI EA
’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 1–8.
[15] T. Shiwa, T. Kanda, M. Imai, H. Ishiguro, N. Hagita, How quickly should communication
robots respond?, in: HRI 2008, 2008, pp. 153–160.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cooney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Mastroianni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Abi-Esber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Brooks</surname>
          </string-name>
          ,
          <article-title>The many minds problem: disclosure in dyadic versus group conversation</article-title>
          ,
          <source>Current Opinion in Psychology</source>
          <volume>31</volume>
          (
          <year>2020</year>
          )
          <fpage>22</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>An overview of text-independent speaker recognition: From features to supervectors</article-title>
          ,
          <source>Speech communication 52</source>
          (
          <year>2010</year>
          )
          <fpage>12</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pappu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sridharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rudnicky</surname>
          </string-name>
          ,
          <article-title>Situated multiparty interaction between humans and agents</article-title>
          , in: International Conference on Human-Computer Interaction, Springer,
          <year>2013</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>116</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yumak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Thalmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <article-title>Tracking and fusion for multiparty interaction</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>