=Paper=
{{Paper
|id=Vol-2473/paper34
|storemode=property
|title=Anticipation and its Applications in Human-machine Interaction
|pdfUrl=https://ceur-ws.org/Vol-2473/paper34.pdf
|volume=Vol-2473
|authors=Stanislav Ondáš,Matúš Pleva
|dblpUrl=https://dblp.org/rec/conf/itat/OndasP19
}}
==Anticipation and its Applications in Human-machine Interaction==
Anticipation and its applications in human-machine interaction Stanislav Ondáš, Matúš Pleva Department of Electronics and Multimedia Communications, Faculty of Electrical Engineering and Informatics, Technical University of Košice, stanislav.ondas@tuke.sk, matus.pleva@tuke.sk Abstract. Behinds the capability of a person to be an Stivers et al. in [7] observed, that human-human interlocutor of a conversation there lies many human conversations are characterized by rapid turn-taking often capabilities, many of which are carried out unconsciously and with a minimal gap lower than 200 msec. Several other very naturally in childhood. Human-like turn-taking in the human-machine interactions (HMI) can be seen as a critical studies (e.g. [8], [9]) indicate that utterance production can issue to achieve natural conversational interaction. The take more than 600msec (see [10], [12]). It indicates, that production of the listener response starts before the speaker utterance productions usually start during listening. This turn is finished, which means, that listener is often able to finding is supported by measurements of EEG signals. anticipate the remaining content of unfolded speaker turn. The Magyari et al. in [2] observed changes in EEG which ability to anticipate can be identified as a very important relates to the fact, that human brain is able to estimate turn human capability, which supports rapid turn-taking. This anticipation process, which occurs during listening relates to duration. This estimation is based on anticipating the way sentence comprehension and turn-taking. The proposed paper the turn would be completed. They founded a neuronal study this phenomenon on the small corpus of Slovak correlate of turn-end anticipation and a beta frequency interviews, where the attention is focused on overlapping desynchronization as early as 1250 msec, before the end of segments and discussed possible applications, where the turn. They suggest that anticipation of the speaker anticipatory behavior on the machine side can bring benefits. utterance leads to accurately timed transitions in everyday conversations. The ability to anticipate the content of the 1 Introduction1 speaker turn and turn end was researched and confirmed in several papers (see e.g. [2], [3], [4], [6]). Nowadays spoken communication between human and machine has become obvious, what relates to the new types of devices, which start to be a part of our everyday life. Good examples are Amazon Echo, Google Home, TV with voice control, Voice search applications, social robots. We can classify that form of communication mostly as simple dialogues, what means question-answer scenarios and task-oriented domain-specific dialogue. Unlike mentioned scenarios, in case of human-human Figure 1 Turn-taking timing spoken interaction, the situation is significantly more complex. Here, information exchange is often performed Anticipation relates to the sentence comprehension. It very quickly, a lot of information is omitted, but can be inferred, that instead of the word sequence, the supplemented by the listener based on the common meaning is anticipated. We can identify, that in the background and history (short-term, long-term, topic- moment, when a person is able to anticipate remaining part related, speaker-related, situation-related). Behind a of the speaker turn, there is partial or complete capability of a person to be an interlocutor of a comprehension, which can be signalized using one of the conversation lies a lot of important human capabilities, following behavior: providing a feedback (backchannel many of which are carried out unconsciously and very signals) or attempt to take the floor. naturally. People are easily able to track the content, to The attempt to take the floor can result into the detect a speech act (dialogue act) behind a speaker’s turn, overlapping speech. Overlapping speech is a segment of the to perform effective and rapid turn-taking, to provide conversation, where both interlocutors speak feedback in a role of listener, or incrementally construct simultaneously. The speaker tries to finish his turn, but the their turns. listener starts his own turn. Several studies quantified a In the proposed work, we focus on the three related significant occurrence of overlaps in dyadic human-human aspects of the spoken communication - turn-taking, spoken interactions (e.g 18.9% in [26]). sentence comprehension and anticipation. We supposed that overlaps are the best place to observe “A turn is the time when a speaker is talking, and turn- anticipatory behavior of the listener. taking is the skill of knowing when to start and finish a turn The reason, why we decided to study the relation in a conversation. It is an important organizational tool in between rapid turn-taking, comprehension and anticipation spoken discourse.” [1] Turn-taking can be described as a is the lack of fluentness in case of human-machine spoken process in which one dialogue participant talks, then stops interaction. We can observe that machines are still not and gives the floor to another participant. It is a human enough skilled in turn-taking. skill, which we learn without any effort in childhood. Rapid turn-taking in HMI is especially important, when we consider the fact, that in case of humanoid robots, social Copyright ©2019 for this paper by its authors. Use permitted under and family robots people tend to expect more natural Creative Commons License Attribution 4.0 International (CC BY 4.0). dialogue interaction, often called “conversation” due to listener can anticipate remaining part of his their human-like embodiment. turn. Human-like turn-taking in the human-machine Overlaps and interruptions: We believe interactions can be seen as a critical issue to achieve natural that the listener can try to take floor, when he conversational interaction in HMI [11 – 12]. A great view has come to the comprehension before the end inside this area can be found in [10]. of the speaker turn. Together with observation of anticipatory behavior in Interruptions differ from overlaps in the timing. Listener human-human interactions, several questions regarding usually use a small pause in the speaker turn to interrupt the human-machine spoken interaction arise. E.g.: speaker and to take a floor without causing an overlap. Would machines let their “predictable” turns unfinished, Such pauses are usually marked as “Transition relevant when they observe a comprehension on the side of human places” (TRP) as defined by Sacks et al. in [27]. They can listener? Can unfinished turns increase a natural character also indicate the place, where the anticipation core [20] or of human-machine interactions? “moment of the maximum understanding” [19] is located. We cannot claim that each TRP is the place, where a Or: listener is able to anticipate. It could be interesting to Would machines be able to interrupt human speaker turn research, whether places, where anticipation core is located and if yes, in which situations? can be marked as TRP. Unfinished turns and occurrence of overlapping segments How can be anticipation integrated on the side of can have several other reasons expect comprehension based machine? on anticipation. Therefore, the careful analysis needs to be How can machines catch the moment when the listener done. has enough information to comprehend? The mutual understanding or comprehension on the listener side can be indicated by backchannel signals, And finally: which are usually produced by human listeners. Could be helpful if the machine would be able to Backchannel signals was defined by Yngve [14] as an interrupt the human interlocutor? acoustic and visual signals provided during the speaker’s turn. Allwood et al. and Poggi in [15, 16], described To answer proposed questions, we decided to collect meaning of acoustic and visual feedback that they provide human-human interactions to analyze turn-taking, overlaps information about the basic communicative functions, as and anticipatory behavior. perception, attention, interest, understanding, attitude (e.g., The paper is organized as follows: The second section belief, liking) and acceptance towards what the speaker is deals with anticipation in human-human spoken interaction, saying. Bevacqua et al. in [17] defined some associations description of the prepared corpus and results of its between the listener’s communicative functions and a set of analysis. Section 3 provides discussion of applications, backchannel signals. They performed an experiment with where anticipation can bring benefits. the 3D Embodied Agent Greta [18], which confirm defined associations. In the described experiment it has been 2 Anticipation in human-human spoken shown, that there exist the association between understanding and following multimodal backchannel interactions signals: raise eyebrows+“ooh”, head nod+“ooh”, head nod+“really”, head nod+“yeah” and head nod. Anticipation play a very important role in the human- human as well as human-machine spoken interaction, because it influences the speed of response in conversation 2.1 Corpus (see [2]) and it is a critical ability to enable “rapid” turn- taking. Anticipation allows the speaker to interpret partial Corpus of the investigative interviews of the TV program utterances. Sagae et al highlights the importance of “Na rovinu” and episodes of TV discussions “Pod lampou” anticipation, when they conclude in [19] „To achieve more were selected for the analysis of turn-taking mechanisms. flexible turn-taking with human users, for whom turn- The overall length of the analyzed corpus was approx. 8 taking and feedback at the sub-utterance level is natural, hours. There were 9 speakers (8 males, 1 female), two of the system needs the ability to start interpretation of user them play a role of the moderator (male). utterances before they are completed.“ and „it also includes Recordings processing have two main levels. In the first an utterance completion capability, where a virtual human level, recordings were transcribed. The first transcriptions can make a strategic decision to display its understanding were generated by our automatic transcription system [25]. of an unfinished user utterance by completing the Then, transcriptions were corrected manually. utterances itself”. Transcriptions have a form of .trs files, which are generated Anticipation is routinely used by human interlocutors in by Transcriber tool. At the second level, turns and overlaps dyadic interactions. Expect the exact results that indicate of both interlocutors were marked in Anvil annotation tool, anticipatory behavior, according our opinion, there can be what enables to analyze turn-taking. identified also other indicators of interlocuters anticipation: In the first step of the analysis we focused mainly on Unfinished turns: We believe that in case overlapping segments. To classify different types of of unfinished turns speaker considers that the overlaps, we designed 13 categories according intent behind the overlaps (see Fig. 2.) Two basic categories – concurrent and cooperative before the end of the speaker turn, what causes overlapping overlaps are further divided according their function. speech. Concurrent/competitive overlaps represent overlaps, More than 84% of overlaps were caused by the where we can identify an attempt to grab the floor. On the moderator, who jumped into respondent speech. other side cooperative/non-competitive category mark Respondent jumped into moderator turns approx. in 16% of overplaps, where listener’s goal is to assist the speaker to all cases. The explanation may be that the moderator has a continue his turn. responsibility for interaction, and he need to maintain topic, topic shifts and timing (duration) of the interaction. Table in Fig.3 shows the distribution of intentions behind the overlaps, which we detected manually. According obtained results, we can conclude significant differences between moderator and respondent. On the side of the moderator, the most numerous categories were: supporting the speaker (approx. 32%), request for completing information or providing additional information (around 20%) and asking the additional question (18%). On the side of the respondent, the most numerous categories imagine overlaps, which occur when a listener expressed involvement and agreement (more than 57%). Obtained results show that roles, which play interlocutors affects the number of overlaps and also intentions behind attempts to take the floor. Figure 2 Overlaps categories Next issue is that designed categories contain several items, which can be identified as backchannel signals (express involvement/agreement/disagreement, express noninterest). As was concluded above, backchannel signals indicate rather partial understanding, than anticipatory behavior. However, information about this dialogue parts is important too, to analyze backchannel mechanisms in different scenarios and roles, which interlocutor plays. Designed categories of overlaps correspond wih speech/dialogue acts, which are conveyed through utterances that interrupt the speaker turn. Confirmation request, Clarification request, Agreement, Disagreement are typical dialogue acts (see e.g. [24]). On the other hand, categories like “Stop the speaker”, “Support speaker”, “Move to another topic”, “Try to take floor” can be seen as a turn-management commands, which help to manage speaker changing. Figure 3 Overlap categories distribution 3 Anticipation and its practical application The main reason, why anticipation needs to be 2.2 Results considered is to support rapid turn-taking, which enables smooth and fluent human-machine spoken interaction. But Table 1. shows statistics related to turn-taking obtained anticipation can enable also other human-like capabilities from analyzed corpus. in HMI. In case of application anticipation in human-machine Table 1. Turn-taking statistics dialogue interactions, there exists only few works that deals 1. Speaker 2. Speaker Total with this phenomenon. We can mention the work of Dominey et al., described in [5], which focuses on next turn (moderator) (respondent) anticipation, based on dialogue history, but in this case the Num. of turns 741 (45%) 905 (55%) 1646 anticipation is not focused inside the turn. Sagae et al. focus Num. of overlaps 483 (84.3%) 90 (15.7%) 573 (34.8%) their work on interpretation of partial utterances, where they are using words prediction – anticipation. They realize an idea of the incremental user utterance processing, which Total number of both speakers performed turns was enables to increase speed of turn-taking (switching the 1646. In 34.8% of cases, turns were taken by the listener speaker). The anticipatory behavior on the machine side means the ability to find an enough reliable hypothesis of unfold utterance just uttered by human interlocutor. Machines that can communicate with user through spoken language implements a human machine communication chain. Modules of this chain have anticipatory potential, because they use resources with related data, as are recognition network, language models or dialogue model. Figure 4 Machine-supported simultaneous 3.1 Applications interpretation While, the anticipatory behavior can be considered as One of the scenarios of machine-supported simultaneous non-important for simple task-oriented dialogue systems, it interpretation is that the system will listen to the speaker can bring more human-like character of the interaction and and try to provide early predictions of the next words advantages in case of systems for multi-party according just recognized partial utterance. Speaker speech conversations, for human-machine collaboration scenarios will be continually recognized by Automatic Speech or for crisis scenarios, where can be useful or necessary for Recognition (ASR) module, where recognition hypothesis the machine to be able to rudely interrupt a human will be provided as often as possible. Each recognition speaker(s), to be able to take a floor and to propose ideas or hypothesis will serve as the input into Hypothesis solutions. construction module, in which, next words will be predicted according general, speaker-dependent and topic 3.1.1 Machine in a role of a moderator models. Statistical n-gram language models/DNN networks that are used in ASR module can also serve for final hypothesis generation. Machine in the role of discussion moderator can be the The same scenario can be used also on the side of the next example of application, where anticipatory behavior interpreter, where the system can suggest next words of his can play important role. Emotionally colored multi-party interpretation. spoken interactions as are e.g. political discussions often Especially interesting can be an automatic transfer of require a lot of effort on the side of moderator to lead the emotions, which can help interpreter to choose the most interaction and to manage turn-taking for all speakers. appropriate words, which consider emotional coloring of Moreover, he often needs to enforce good behavior or interpreted speech. Work proposed in [23] can be used for compliance with specified time. desired emotion transfer. 3.1.2 Machine in a task of simultaneous interpretation Conclusions Anticipating in simultaneous interpreting simply means that interpreters say a word or a group of words before the The aim of the paper is to stimulate discussion on the use speaker actually says them [13]. It means that interpreters, of anticipation and anticipatory behavior in the HMI and in familiar with the domain and content, can interpret the practical applications. Certainly, anticipation lies beforehand thank to the anticipation. Anticipation is a key behind the smooth human-human interactions and rapid competence that interpreters need to learn before they can turn-taking and we believe that it can significantly become professionals [13], [21]. accelerate human-machine spoken interaction and to Nowadays, machine translation is a common application. support human-like character of the HMI. There also exist applications, which perform interpretation, We believe that machine-supported anticipation can but they interpret whole sentences after their pronunciation. decrease cognitive load in several applications, e.g. in But, if we imagine machine as the simultaneous interpreter, simultaneous interpretation, where such a system can it must be able to anticipate, to produce fluent simultaneous suggest hypothesis for the interpreter in advance or to interpretations without meaningless gaps. In that case, support preparation of the interpretation result. anticipation can enhance perceived quality and clarity of We realize that obtained results are relevant only to the the translation. interview scenario and the distribution of analyzed overlaps Anticipatory function can help also human interpreters in category will change according roles, relationship, the way, that it can suggest them hypothesis about next emotions of interlocutors and according discussed topics. words in the speaker utterance. Architecture of such The first challenge of our future work will be collecting and machine-supported simultaneous interpretation system is analysing of dialogue interactions in other scenarios. sketched on the Fig. 4. Acknowledgment The research presented in this paper was supported by the Slovak Research and Development Agency projects APVV SK-TW-2017-0005, APVV-15-0731, Ministry of Virtual Agents, IVA 2010. Lecture Notes in Computer Education, Science, Research and Sport of the Slovak Science, Vol.6356. Springer, Berlin, Heidelberg, 2010. Republic under the research project VEGA 1/0511/17 and [18] R. Niewiadomski, E. Bevacqua, M. Mancini, and C. by Cultural and Educational Grant Agency of the Slovak Pelachaud, “Greta: an interactive expressive eca system”, in: AAMAS 2009 - Autonomous Agents and Republic, grant No. KEGA 009TUKE-4/2019. MultiAgent Systems, Budapest, Hungary, 2009. [19] K. Sagae, D. DeVault, and D.R. Traum, “Interpretation References of partial utterances in virtual human dialogue systems”, Proc. of the NAACL HLT 2010. Association for Computational Linguistics, Stroudsburg, PA, USA, [1] https://www.teachingenglish.org.uk/article/turn-taking pp.33-36, 2010. [2] L. Magyari, M.C. Bastiaansen, J.P. de Ruiter, and [20] E. Kiktova, J. Zimmermann, “Detection of S.C. Levinson, “Early Anticipation Lies behind the Anticipation Nucleus using HMM and Fuzzy based Speed of Response in Conversation”, Journal of Approaches”, (in review process) DISA 2018, August, Cognitive Neuroscience, pp. 2530-2539, 2014. Košice, Slovakia [3] R.S. Gisladottir, S. Bögels, and S.C. Levinson, [21] K. G. Seeber, “Intonation and Anticipation in “Oscillatory Brain Responses Reflect Anticipation Simultaneous Interpreting”, Cahiers de Linguistique during Comprehension of Speech Acts in Spoken Française, vol.23, 2001, pp. 61-97, ISSN 1661-3171, Dialog”, Front. Hum. Neurosci, 2018. 2001 [4] A.J. Liddicoat, “The projectability of turn [22] S. Ondáš, J. Juhár, M. Pleva, et. al.: “Speech constructional units and the role of prediction in technologies for advanced applications in service listening”, Discourse Studies, Vol.6, No.4, pp. 449- robotics”, in Acta Polytechnica Hungarica. Vol. 10, no. 469, 2004. 5 (2013), p. 45-61., ISSN 1785-8860 [5] P.F. Dominey, G. Metta, F. Nori, and L. Natale, [23] M. Mikula and K. Machová, “Combined approach for “Anticipation and initiative in human-humanoid sentiment analysis in Slovak using a dictionary interaction“, Humanoids 2008 - 8th IEEE-RAS annotated by particle swarm optimization.”, in: Acta International Conference on Humanoid Robots, Electrotechnica et Informatica, Vol. 18, No. 2, 2018, Daejeon, pp. 693-699, 2008. 27–34, DOI: 10.15546/aeei-2018-0013 [6] M. Paľová, E. Kiktová “Prosodic anticipatory clues [24] S. Ondáš and J. Juhár, "Distance-based dialog acts and reference activation in simultaneous labeling," 2015 6th IEEE International Conference on interpretation”, in XLinguae 12(1XL), pp. 13-22. Cognitive Infocommunications (CogInfoCom), Gyor, 2019. 2015, pp. 99-103. [7] T. Stivers, N. J. Enfield, P. Brown, C. Englert, M. [25] Lojka M., Viszlay, P., Stas, J., Hladek, D., Juhar, J.: Hayashi, T. Heinemann, et al. (2009). Universals and Slo-vak Broadcast News Speech Recognition and cultural variation in turn-taking in conversation. TranscriptionSystem. International Conference on Proceedings of the National Academy of Sciences, Network-Based Infor-mation Systems. In: Barolli L., U.S.A., 106, 10587–10592. Kryvinska N., Enokido T.,Takizawa M. (eds) [8] P. Indefrey, W. J. M. Levelt, “The spatial and temporal Advances in Network-Based Informa-tion Systems. signatures of word production components”. NBiS 2018. Lecture Notes on Data Engineer-ing and Cognition, 92, 101–144, 2004 Communications Technologies - LNDECT, vol [9] T. T. Schnurr, A., Costa, A. Caramazza, “Planning at 22.Springer, Cham, pp. 385–394, 2019. the phonological level during sentence production.”, [26] I. Siegert, R. Bock, A. Wendemuth, B. Vlasenko and Journal of Psycholinguistics Research, 35, 189–213, K. Ohnemus, "Overlapping speech, utterance duration 2006. and affective content in HHI and HCI - An [10] J. Holler, K, H. Kendrick, M. Casillas, S. C. Levinson, comparison," 2015 6th IEEE International Conference eds. (2016). Turn-Taking in Human Communicative on Cognitive Infocommunications (CogInfoCom), Interaction. Lausanne: Frontiers Media. doi: Gyor, 2015, pp. 83-88. 10.3389/978-2-88919-825-2, 2016 [27] H. Sacks, E. A. Schegloff, and G. Jefferson, “A [11] K. R. Thórisson, “Natural turn-raking needs no simplest systematics for the organization of turn-taking manual: computational theory and model, from for conversation,” Language, vol. 50, no. 4, pp. 696– perception to action,” in Multimodality in Language 735, Dec. 1974. and Speech Systems, eds B. Granström, D. House, and I. Karlsson (Netherlands: Springer), 173–207, 2002 [12] M. Heldner, and J. Edlund, “Pauses, gaps and overlaps in conversations.” J.Phon. 38, 555–568, 2010 [13] Anticipation in simultaneous interpreting, https://www.languageconnections.com/, March 2018. [14] V. Yngve, “On getting a word in edgewise”, in Papers from the Sixth Regional Meeting of the Chicago Linguistic Society, pp. 567–577, 1970. [15] J. Allwood, J. Nivre, and E. Ahlsn, “On the semantics and pragmatics of linguistic feedback”, Semantics Vol.9, No.1, 1993. [16] I. Poggi, “Mind, hands, face and body. A goal and belief view of multimodal communication”, Weidler, Berlin, 2007. [17] E. Bevacqua, S. Pammi, S.J. Hyniewska, M. Schröder, and C. Pelachaud, “Multimodal Backchannels for Embodied Conversational Agents”, in: Intelligent