=Paper=
{{Paper
|id=Vol-2473/paper34
|storemode=property
|title=Anticipation and its Applications in Human-machine Interaction
|pdfUrl=https://ceur-ws.org/Vol-2473/paper34.pdf
|volume=Vol-2473
|authors=Stanislav Ondáš,Matúš Pleva
|dblpUrl=https://dblp.org/rec/conf/itat/OndasP19
}}
==Anticipation and its Applications in Human-machine Interaction==
<pdf width="1500px">https://ceur-ws.org/Vol-2473/paper34.pdf</pdf>
<pre>
                 Anticipation and its applications in human-machine interaction
                                                   Stanislav Ondáš, Matúš Pleva
                                   Department of Electronics and Multimedia Communications,
                        Faculty of Electrical Engineering and Informatics, Technical University of Košice,
                                          stanislav.ondas@tuke.sk, matus.pleva@tuke.sk

Abstract. Behinds the capability of a person to be an                    Stivers et al. in [7] observed, that human-human
interlocutor of a conversation there lies many human                  conversations are characterized by rapid turn-taking often
capabilities, many of which are carried out unconsciously and         with a minimal gap lower than 200 msec. Several other
very naturally in childhood. Human-like turn-taking in the
human-machine interactions (HMI) can be seen as a critical            studies (e.g. [8], [9]) indicate that utterance production can
issue to achieve natural conversational interaction. The              take more than 600msec (see [10], [12]). It indicates, that
production of the listener response starts before the speaker         utterance productions usually start during listening. This
turn is finished, which means, that listener is often able to         finding is supported by measurements of EEG signals.
anticipate the remaining content of unfolded speaker turn. The        Magyari et al. in [2] observed changes in EEG which
ability to anticipate can be identified as a very important           relates to the fact, that human brain is able to estimate turn
human capability, which supports rapid turn-taking. This
anticipation process, which occurs during listening relates to        duration. This estimation is based on anticipating the way
sentence comprehension and turn-taking. The proposed paper            the turn would be completed. They founded a neuronal
study this phenomenon on the small corpus of Slovak                   correlate of turn-end anticipation and a beta frequency
interviews, where the attention is focused on overlapping             desynchronization as early as 1250 msec, before the end of
segments and discussed possible applications, where                   the turn. They suggest that anticipation of the speaker
anticipatory behavior on the machine side can bring benefits.         utterance leads to accurately timed transitions in everyday
                                                                      conversations. The ability to anticipate the content of the
1     Introduction1                                                   speaker turn and turn end was researched and confirmed in
                                                                      several papers (see e.g. [2], [3], [4], [6]).
   Nowadays spoken communication between human and
machine has become obvious, what relates to the new types
of devices, which start to be a part of our everyday life.
Good examples are Amazon Echo, Google Home, TV with
voice control, Voice search applications, social robots.
   We can classify that form of communication mostly as
simple dialogues, what means question-answer scenarios
and task-oriented domain-specific dialogue.
   Unlike mentioned scenarios, in case of human-human                   Figure 1 Turn-taking timing
spoken interaction, the situation is significantly more
complex. Here, information exchange is often performed                   Anticipation relates to the sentence comprehension. It
very quickly, a lot of information is omitted, but                    can be inferred, that instead of the word sequence, the
supplemented by the listener based on the common                      meaning is anticipated. We can identify, that in the
background and history (short-term, long-term, topic-                 moment, when a person is able to anticipate remaining part
related, speaker-related, situation-related). Behind a                of the speaker turn, there is partial or complete
capability of a person to be an interlocutor of a                     comprehension, which can be signalized using one of the
conversation lies a lot of important human capabilities,              following behavior: providing a feedback (backchannel
many of which are carried out unconsciously and very                  signals) or attempt to take the floor.
naturally. People are easily able to track the content, to               The attempt to take the floor can result into the
detect a speech act (dialogue act) behind a speaker’s turn,           overlapping speech. Overlapping speech is a segment of the
to perform effective and rapid turn-taking, to provide                conversation,      where      both     interlocutors   speak
feedback in a role of listener, or incrementally construct            simultaneously. The speaker tries to finish his turn, but the
their turns.                                                          listener starts his own turn. Several studies quantified a
   In the proposed work, we focus on the three related                significant occurrence of overlaps in dyadic human-human
aspects of the spoken communication - turn-taking,                    spoken interactions (e.g 18.9% in [26]).
sentence comprehension and anticipation.                                 We supposed that overlaps are the best place to observe
   “A turn is the time when a speaker is talking, and turn-           anticipatory behavior of the listener.
taking is the skill of knowing when to start and finish a turn             The reason, why we decided to study the relation
in a conversation. It is an important organizational tool in          between rapid turn-taking, comprehension and anticipation
spoken discourse.” [1] Turn-taking can be described as a              is the lack of fluentness in case of human-machine spoken
process in which one dialogue participant talks, then stops           interaction. We can observe that machines are still not
and gives the floor to another participant. It is a human             enough skilled in turn-taking.
skill, which we learn without any effort in childhood.                   Rapid turn-taking in HMI is especially important, when
                                                                      we consider the fact, that in case of humanoid robots, social
Copyright ©2019 for this paper by its authors. Use permitted under    and family robots people tend to expect more natural
Creative Commons License Attribution 4.0 International (CC BY 4.0).
dialogue interaction, often called “conversation” due to                       listener can anticipate remaining part of his
their human-like embodiment.                                                   turn.
   Human-like turn-taking in the human-machine                                     Overlaps and interruptions: We believe
interactions can be seen as a critical issue to achieve natural                that the listener can try to take floor, when he
conversational interaction in HMI [11 – 12]. A great view                      has come to the comprehension before the end
inside this area can be found in [10].                                         of the speaker turn.
   Together with observation of anticipatory behavior in             Interruptions differ from overlaps in the timing. Listener
human-human interactions, several questions regarding             usually use a small pause in the speaker turn to interrupt the
human-machine spoken interaction arise. E.g.:                     speaker and to take a floor without causing an overlap.
   Would machines let their “predictable” turns unfinished,       Such pauses are usually marked as “Transition relevant
when they observe a comprehension on the side of human            places” (TRP) as defined by Sacks et al. in [27]. They can
listener? Can unfinished turns increase a natural character       also indicate the place, where the anticipation core [20] or
of human-machine interactions?                                    “moment of the maximum understanding” [19] is located.
                                                                  We cannot claim that each TRP is the place, where a
  Or:                                                             listener is able to anticipate. It could be interesting to
  Would machines be able to interrupt human speaker turn          research, whether places, where anticipation core is located
and if yes, in which situations?                                  can be marked as TRP.
                                                                     Unfinished turns and occurrence of overlapping segments
  How can be anticipation integrated on the side of               can have several other reasons expect comprehension based
machine?                                                          on anticipation. Therefore, the careful analysis needs to be
  How can machines catch the moment when the listener             done.
has enough information to comprehend?                                The mutual understanding or comprehension on the
                                                                  listener side can be indicated by backchannel signals,
   And finally:                                                   which are usually produced by human listeners.
   Could be helpful if the machine would be able to               Backchannel signals was defined by Yngve [14] as an
interrupt the human interlocutor?                                 acoustic and visual signals provided during the speaker’s
                                                                  turn. Allwood et al. and Poggi in [15, 16], described
  To answer proposed questions, we decided to collect             meaning of acoustic and visual feedback that they provide
human-human interactions to analyze turn-taking, overlaps         information about the basic communicative functions, as
and anticipatory behavior.                                        perception, attention, interest, understanding, attitude (e.g.,
  The paper is organized as follows: The second section           belief, liking) and acceptance towards what the speaker is
deals with anticipation in human-human spoken interaction,        saying. Bevacqua et al. in [17] defined some associations
description of the prepared corpus and results of its             between the listener’s communicative functions and a set of
analysis. Section 3 provides discussion of applications,          backchannel signals. They performed an experiment with
where anticipation can bring benefits.                            the 3D Embodied Agent Greta [18], which confirm defined
                                                                  associations. In the described experiment it has been
2    Anticipation in human-human spoken                           shown, that there exist the association between
                                                                  understanding and following multimodal backchannel
     interactions                                                 signals: raise eyebrows+“ooh”, head nod+“ooh”, head
                                                                  nod+“really”, head nod+“yeah” and head nod.
   Anticipation play a very important role in the human-
human as well as human-machine spoken interaction,
because it influences the speed of response in conversation       2.1 Corpus
(see [2]) and it is a critical ability to enable “rapid” turn-
taking. Anticipation allows the speaker to interpret partial         Corpus of the investigative interviews of the TV program
utterances. Sagae et al highlights the importance of              “Na rovinu” and episodes of TV discussions “Pod lampou”
anticipation, when they conclude in [19] „To achieve more         were selected for the analysis of turn-taking mechanisms.
flexible turn-taking with human users, for whom turn-                The overall length of the analyzed corpus was approx. 8
taking and feedback at the sub-utterance level is natural,        hours. There were 9 speakers (8 males, 1 female), two of
the system needs the ability to start interpretation of user      them play a role of the moderator (male).
utterances before they are completed.“ and „it also includes         Recordings processing have two main levels. In the first
an utterance completion capability, where a virtual human         level, recordings were transcribed. The first transcriptions
can make a strategic decision to display its understanding        were generated by our automatic transcription system [25].
of an unfinished user utterance by completing the                 Then,     transcriptions    were      corrected     manually.
utterances itself”.                                               Transcriptions have a form of .trs files, which are generated
   Anticipation is routinely used by human interlocutors in       by Transcriber tool. At the second level, turns and overlaps
dyadic interactions. Expect the exact results that indicate       of both interlocutors were marked in Anvil annotation tool,
anticipatory behavior, according our opinion, there can be        what enables to analyze turn-taking.
identified also other indicators of interlocuters anticipation:      In the first step of the analysis we focused mainly on
                Unfinished turns: We believe that in case        overlapping segments. To classify different types of
             of unfinished turns speaker considers that the       overlaps, we designed 13 categories according intent
                                                                  behind the overlaps (see Fig. 2.)
  Two basic categories – concurrent and cooperative                  before the end of the speaker turn, what causes overlapping
overlaps are further divided according their function.               speech.
  Concurrent/competitive overlaps represent overlaps,                   More than 84% of overlaps were caused by the
where we can identify an attempt to grab the floor. On the           moderator, who jumped into respondent speech.
other side cooperative/non-competitive category mark                 Respondent jumped into moderator turns approx. in 16% of
overplaps, where listener’s goal is to assist the speaker to         all cases. The explanation may be that the moderator has a
continue his turn.                                                   responsibility for interaction, and he need to maintain topic,
                                                                     topic shifts and timing (duration) of the interaction.


                                                                        Table in Fig.3 shows the distribution of intentions behind
                                                                     the overlaps, which we detected manually.
                                                                        According obtained results, we can conclude significant
                                                                     differences between moderator and respondent. On the side
                                                                     of the moderator, the most numerous categories were:
                                                                     supporting the speaker (approx. 32%), request for
                                                                     completing information or providing additional information
                                                                     (around 20%) and asking the additional question (18%).
                                                                        On the side of the respondent, the most numerous
                                                                     categories imagine overlaps, which occur when a listener
                                                                     expressed involvement and agreement (more than 57%).
                                                                        Obtained results show that roles, which play
                                                                     interlocutors affects the number of overlaps and also
                                                                     intentions behind attempts to take the floor.
  Figure 2 Overlaps categories                                          Next issue is that designed categories contain several
                                                                     items, which can be identified as backchannel signals
                                                                     (express involvement/agreement/disagreement, express
                                                                     noninterest). As was concluded above, backchannel signals
                                                                     indicate rather partial understanding, than anticipatory
                                                                     behavior. However, information about this dialogue parts is
                                                                     important too, to analyze backchannel mechanisms in
                                                                     different scenarios and roles, which interlocutor plays.
                                                                        Designed categories of overlaps correspond wih
                                                                     speech/dialogue acts, which are conveyed through
                                                                     utterances that interrupt the speaker turn. Confirmation
                                                                     request, Clarification request, Agreement, Disagreement
                                                                     are typical dialogue acts (see e.g. [24]). On the other hand,
                                                                     categories like “Stop the speaker”, “Support speaker”,
                                                                     “Move to another topic”, “Try to take floor” can be seen as
                                                                     a turn-management commands, which help to manage
                                                                     speaker changing.


  Figure 3 Overlap categories distribution                           3    Anticipation and its practical application

                                                                        The main reason, why anticipation needs to be
2.2 Results                                                          considered is to support rapid turn-taking, which enables
                                                                     smooth and fluent human-machine spoken interaction. But
   Table 1. shows statistics related to turn-taking obtained         anticipation can enable also other human-like capabilities
from analyzed corpus.                                                in HMI.
                                                                        In case of application anticipation in human-machine
                     Table 1. Turn-taking statistics                 dialogue interactions, there exists only few works that deals
                        1. Speaker     2. Speaker      Total         with this phenomenon. We can mention the work of
                                                                     Dominey et al., described in [5], which focuses on next turn
                        (moderator)    (respondent)
                                                                     anticipation, based on dialogue history, but in this case the
  Num. of turns         741 (45%)      905 (55%)          1646       anticipation is not focused inside the turn. Sagae et al. focus
  Num. of overlaps      483 (84.3%)    90 (15.7%)      573 (34.8%)   their work on interpretation of partial utterances, where
                                                                     they are using words prediction – anticipation. They realize
                                                                     an idea of the incremental user utterance processing, which
  Total number of both speakers performed turns was                  enables to increase speed of turn-taking (switching the
1646. In 34.8% of cases, turns were taken by the listener            speaker).
   The anticipatory behavior on the machine side means the
ability to find an enough reliable hypothesis of unfold
utterance just uttered by human interlocutor. Machines that
can communicate with user through spoken language
implements a human machine communication chain.
Modules of this chain have anticipatory potential, because
they use resources with related data, as are recognition
network, language models or dialogue model.

                                                                   Figure      4     Machine-supported         simultaneous
3.1 Applications                                                 interpretation
   While, the anticipatory behavior can be considered as            One of the scenarios of machine-supported simultaneous
non-important for simple task-oriented dialogue systems, it      interpretation is that the system will listen to the speaker
can bring more human-like character of the interaction and       and try to provide early predictions of the next words
advantages in case of systems for multi-party                    according just recognized partial utterance. Speaker speech
conversations, for human-machine collaboration scenarios         will be continually recognized by Automatic Speech
or for crisis scenarios, where can be useful or necessary for    Recognition (ASR) module, where recognition hypothesis
the machine to be able to rudely interrupt a human               will be provided as often as possible. Each recognition
speaker(s), to be able to take a floor and to propose ideas or   hypothesis will serve as the input into Hypothesis
solutions.                                                       construction module, in which, next words will be
                                                                 predicted according general, speaker-dependent and topic
3.1.1 Machine in a role of a moderator                           models. Statistical n-gram language models/DNN networks
                                                                 that are used in ASR module can also serve for final
                                                                 hypothesis generation.
   Machine in the role of discussion moderator can be the
                                                                    The same scenario can be used also on the side of the
next example of application, where anticipatory behavior
                                                                 interpreter, where the system can suggest next words of his
can play important role. Emotionally colored multi-party
                                                                 interpretation.
spoken interactions as are e.g. political discussions often
                                                                    Especially interesting can be an automatic transfer of
require a lot of effort on the side of moderator to lead the
                                                                 emotions, which can help interpreter to choose the most
interaction and to manage turn-taking for all speakers.
                                                                 appropriate words, which consider emotional coloring of
Moreover, he often needs to enforce good behavior or
                                                                 interpreted speech. Work proposed in [23] can be used for
compliance with specified time.
                                                                 desired emotion transfer.
3.1.2 Machine in a task of simultaneous interpretation
                                                                      Conclusions
   Anticipating in simultaneous interpreting simply means
that interpreters say a word or a group of words before the         The aim of the paper is to stimulate discussion on the use
speaker actually says them [13]. It means that interpreters,     of anticipation and anticipatory behavior in the HMI and in
familiar with the domain and content, can interpret              the practical applications. Certainly, anticipation lies
beforehand thank to the anticipation. Anticipation is a key      behind the smooth human-human interactions and rapid
competence that interpreters need to learn before they can       turn-taking and we believe that it can significantly
become professionals [13], [21].                                 accelerate human-machine spoken interaction and to
   Nowadays, machine translation is a common application.        support human-like character of the HMI.
There also exist applications, which perform interpretation,        We believe that machine-supported anticipation can
but they interpret whole sentences after their pronunciation.    decrease cognitive load in several applications, e.g. in
But, if we imagine machine as the simultaneous interpreter,      simultaneous interpretation, where such a system can
it must be able to anticipate, to produce fluent simultaneous    suggest hypothesis for the interpreter in advance or to
interpretations without meaningless gaps. In that case,          support preparation of the interpretation result.
anticipation can enhance perceived quality and clarity of           We realize that obtained results are relevant only to the
the translation.                                                 interview scenario and the distribution of analyzed overlaps
   Anticipatory function can help also human interpreters in     category will change according roles, relationship,
the way, that it can suggest them hypothesis about next          emotions of interlocutors and according discussed topics.
words in the speaker utterance. Architecture of such             The first challenge of our future work will be collecting and
machine-supported simultaneous interpretation system is          analysing of dialogue interactions in other scenarios.
sketched on the Fig. 4.

                                                                 Acknowledgment

                                                                    The research presented in this paper was supported by
                                                                 the Slovak Research and Development Agency projects
APVV SK-TW-2017-0005, APVV-15-0731, Ministry of                       Virtual Agents, IVA 2010. Lecture Notes in Computer
Education, Science, Research and Sport of the Slovak                  Science, Vol.6356. Springer, Berlin, Heidelberg, 2010.
Republic under the research project VEGA 1/0511/17 and           [18] R. Niewiadomski, E. Bevacqua, M. Mancini, and C.
by Cultural and Educational Grant Agency of the Slovak                Pelachaud, “Greta: an interactive expressive eca
                                                                      system”, in: AAMAS 2009 - Autonomous Agents and
Republic, grant No. KEGA 009TUKE-4/2019.                              MultiAgent Systems, Budapest, Hungary, 2009.
                                                                 [19] K. Sagae, D. DeVault, and D.R. Traum, “Interpretation
References                                                            of partial utterances in virtual human dialogue
                                                                      systems”, Proc. of the NAACL HLT 2010. Association
                                                                      for Computational Linguistics, Stroudsburg, PA, USA,
[1] https://www.teachingenglish.org.uk/article/turn-taking            pp.33-36, 2010.
[2] L. Magyari, M.C. Bastiaansen, J.P. de Ruiter, and            [20] E. Kiktova, J. Zimmermann, “Detection of
     S.C. Levinson, “Early Anticipation Lies behind the               Anticipation Nucleus using HMM and Fuzzy based
     Speed of Response in Conversation”, Journal of                   Approaches”, (in review process) DISA 2018, August,
     Cognitive Neuroscience, pp. 2530-2539, 2014.                     Košice, Slovakia
[3] R.S. Gisladottir, S. Bögels, and S.C. Levinson,              [21] K. G. Seeber, “Intonation and Anticipation in
     “Oscillatory Brain Responses Reflect Anticipation                Simultaneous Interpreting”, Cahiers de Linguistique
     during Comprehension of Speech Acts in Spoken                    Française, vol.23, 2001, pp. 61-97, ISSN 1661-3171,
     Dialog”, Front. Hum. Neurosci, 2018.                             2001
[4] A.J. Liddicoat, “The projectability of turn                  [22] S. Ondáš, J. Juhár, M. Pleva, et. al.: “Speech
     constructional units and the role of prediction in               technologies for advanced applications in service
     listening”, Discourse Studies, Vol.6, No.4, pp. 449-             robotics”, in Acta Polytechnica Hungarica. Vol. 10, no.
     469, 2004.                                                       5 (2013), p. 45-61., ISSN 1785-8860
[5] P.F. Dominey, G. Metta, F. Nori, and L. Natale,              [23] M. Mikula and K. Machová, “Combined approach for
     “Anticipation and initiative in human-humanoid                   sentiment analysis in Slovak using a dictionary
     interaction“,    Humanoids 2008 - 8th IEEE-RAS                   annotated by particle swarm optimization.”, in: Acta
     International Conference on Humanoid Robots,                     Electrotechnica et Informatica, Vol. 18, No. 2, 2018,
     Daejeon, pp. 693-699, 2008.                                      27–34, DOI: 10.15546/aeei-2018-0013
[6] M. Paľová, E. Kiktová “Prosodic anticipatory clues           [24] S. Ondáš and J. Juhár, "Distance-based dialog acts
     and      reference    activation    in     simultaneous          labeling," 2015 6th IEEE International Conference on
     interpretation”, in XLinguae 12(1XL), pp. 13-22.                 Cognitive Infocommunications (CogInfoCom), Gyor,
     2019.                                                            2015, pp. 99-103.
[7] T. Stivers, N. J. Enfield, P. Brown, C. Englert, M.          [25] Lojka M., Viszlay, P., Stas, J., Hladek, D., Juhar, J.:
     Hayashi, T. Heinemann, et al. (2009). Universals and             Slo-vak Broadcast News Speech Recognition and
     cultural variation in turn-taking in conversation.               TranscriptionSystem. International Conference on
     Proceedings of the National Academy of Sciences,                 Network-Based Infor-mation Systems. In: Barolli L.,
     U.S.A., 106, 10587–10592.                                        Kryvinska N., Enokido T.,Takizawa M. (eds)
[8] P. Indefrey, W. J. M. Levelt, “The spatial and temporal           Advances in Network-Based Informa-tion Systems.
     signatures of word production components”.                       NBiS 2018. Lecture Notes on Data Engineer-ing and
     Cognition, 92, 101–144, 2004                                     Communications Technologies - LNDECT, vol
[9] T. T. Schnurr, A., Costa, A. Caramazza, “Planning at              22.Springer, Cham, pp. 385–394, 2019.
     the phonological level during sentence production.”,        [26] I. Siegert, R. Bock, A. Wendemuth, B. Vlasenko and
     Journal of Psycholinguistics Research, 35, 189–213,              K. Ohnemus, "Overlapping speech, utterance duration
     2006.                                                            and affective content in HHI and HCI - An
[10] J. Holler, K, H. Kendrick, M. Casillas, S. C. Levinson,          comparison," 2015 6th IEEE International Conference
     eds. (2016). Turn-Taking in Human Communicative                  on Cognitive Infocommunications (CogInfoCom),
     Interaction. Lausanne: Frontiers Media. doi:                     Gyor, 2015, pp. 83-88.
     10.3389/978-2-88919-825-2, 2016                             [27] H. Sacks, E. A. Schegloff, and G. Jefferson, “A
[11] K. R. Thórisson, “Natural turn-raking needs no                   simplest systematics for the organization of turn-taking
     manual: computational theory and model, from                     for conversation,” Language, vol. 50, no. 4, pp. 696–
     perception to action,” in Multimodality in Language              735, Dec. 1974.
     and Speech Systems, eds B. Granström, D. House, and
     I. Karlsson (Netherlands: Springer), 173–207, 2002
[12] M. Heldner, and J. Edlund, “Pauses, gaps and overlaps
     in conversations.” J.Phon. 38, 555–568, 2010
[13] Anticipation       in    simultaneous       interpreting,
     https://www.languageconnections.com/, March 2018.
[14] V. Yngve, “On getting a word in edgewise”, in Papers
     from the Sixth Regional Meeting of the Chicago
     Linguistic Society, pp. 567–577, 1970.
[15] J. Allwood, J. Nivre, and E. Ahlsn, “On the semantics
     and pragmatics of linguistic feedback”, Semantics
     Vol.9, No.1, 1993.
[16] I. Poggi, “Mind, hands, face and body. A goal and
     belief view of multimodal communication”, Weidler,
     Berlin, 2007.
[17] E. Bevacqua, S. Pammi, S.J. Hyniewska, M. Schröder,
     and C. Pelachaud, “Multimodal Backchannels for
     Embodied Conversational Agents”, in: Intelligent

</pre>