New applications of gaze tracking in speech science Mattias Bystedt[0000-1111-2222-3333] and Jens Edlund[0000-0001-9327-9482] KTH Royal Institute of Technology, Speech, Music & Hearing, Stockholm 2 Springer Heidelberg, Tiergartenstr. 17, 69121 Heidelberg, Germany mbystedt@kth.se, edlund@speech.kth.se Abstract. We present an overview of speech research applications of gaze track- ing technology, where gaze behaviours are exploited as a tool for analysis rather than as a primary object of study. The methods presented are all in their infancy, but can greatly assist the analysis of digital audio and video as well as unlock the relationship between writing and other encodings on the one hand, and natural language, such as speech, on the other. We discuss three directions in this type of gaze tracking application: model- ling of text that is read aloud, evaluation and annotation with naïve informants, and evaluation and annotation with expert annotators. In each of these areas, we use gaze tracking information to gauge the behaviour of people when working with speech and conversation, rather than when reading text aloud or partaking in conversations, in order to learn something about how the speech may be ana- lysed from a human perspective. Keywords: gaze tracking, speech technology, label acquisition, annotation 1 Introduction Gaze tracking is used in a number of applications that aim to improve interaction be- tween humans or between humans and machines. Examples include interfaces that uti- lize gaze tracking to allow hands-free pointing, to track the addressee of speech, or that utilize pupil dilation to track cognitive load, for example in drivers. There is a wide range of such application areas and more. Gaze has been associated with for example cognitive state, cognitive load, direction of visual attention and turn taking in conver- sation (Eckstein, Guerra-Carrillo, Miller Singley, & Bunge 2017; Rayner 1998). More recently, a number of new applications of gaze tracking have surfaced in which gaze tracking is used to model the behaviour of people who are somehow working with or analysing actions (e.g. speech), rather than people who are in the process of perform- ing them (e.g. partaking in a conversation). In the following, we describe three main areas where gaze tracking is exploited in this relatively new manner, as a tool to assist in the analysis of speech and language, rather than as an object of research in itself: 74  Modelling of text to be read aloud  Evaluation and annotation with naïve informants  Evaluation and annotation with expert annotators We discuss the potential benefits and the risks involved, and highlight the particular requirements placed on gaze tracking technology by these specific application areas, as compared to other areas where gaze is used, such as in experimental psychology and in interaction design. 2 Modelling of text to be read aloud Reading texts aloud is an important application area of speech technology. Apart from its use in commercial applications, speech synthesis, or text-to-speech (TTS), is used by government authorities to produce talking books as a means of making texts acces- sible to those who are for some reason not able to read in the traditional sense of the word. As an example, the Swedish Agency for Accessible Media has produced thou- sands of talking university text books and continuously produces around a hundred talking newspapers using TTS. The impact of less-than-perfect TTS, then, is great, but we still struggle with finding viable ways of modelling how text should best be read aloud. A key characteristic of speech is that it exhibits variation in prominence. Speakers make some words more prominent by lengthening and by expansion of spectral char- acteristics, whereas other words are more backgrounded, that is, reduced. A number of models have been suggested to help assess which words in a text that is to be read aloud should be made prominent, including measures based on the probabilistic aspects of the text (Malisz, Brandt, Möbius, Oh, & Andreeva 2018). Namely, high contextual proba- bility of a word correlates with its lower prominence and vice versa (Aylett & Turk 2004), consequently assessing the probability might help identify words to be made prominent. Behavioural research has long been using gaze data to study reading, which has il- luminated salient linguistic cues used by skilled readers. Word frequency, neighbour- hood frequency, and syllable length have all been associated with our reading behaviour and with our processing of textual information. It is therefore likely that gaze data can refine, compliment or substitute statistical models of text. A key difference between gaze behaviour and language models is that the latter, while efficient in capturing predictability from a within-text point of view, they do not capture the processing steps as they occur online while a person is reading. Gaze data has also been shown to correlate with subjective measures of word predictability in sentences (Bystedt 2016; Schwanenflugel & LaCount 1988). 75 In ongoing studies, we are attempting to model prominence via gaze, by collating gaze tracking data from multiple readers of the same text to determine to which words readers pay particular attention. We hypothesize that quantified gaze data can accom- pany and strengthen models based on sample distribution of contextual predictability of words in the text – statistics which frequently form the basis for prosodic models. Others have also pointed out that future prosodic modelling with gaze or other similarly inspired applications could emerge as alternative methods for modelling prosody in TTS (Vainio, Suni, & Aalto 2015). 3 Evaluation and annotation with naïve informants Speech science and speech technology research are increasingly data driven, and today, the greatest bottleneck for machine learning of speech and human interaction behav- iours is no longer limited access to speech data, of which there is a lot, but rather access to useful annotations of speech data. Acquiring good annotations, or labels, on which to train models is expensive and time consuming, and developing efficient methods for these purposes is becoming increasingly important. In addition, speech technology research is in most cases an iterative process, where a method is developed and trained, then evaluated, and then retrained using more or better data. The results of many evaluation results can be viewed as a series of examples of what went well and what did not work. Seen from another perspective, this is training data, and it is often the case that the results of an evaluation that ends an iteration be- come the input for the training of the next iteration. Evaluation and annotation, then, are closely related. Gaze tracking has been used both for evaluation and annotation in speech science. Here, we discuss two methods that operate on naïve informants in the sense that the informants are not professionals, nor have they received any special training to com- plete their tasks. Instead, these methods tap into more or less subconscious human com- municative behaviours. The first method evaluates TTS quality by measuring gaze pat- terns of informants responding to instructions during an audio-visual task, and the sec- ond tracks the gaze target of people observing recorded interactions to learn more about speaker changes in conversation. Most TTS evaluation methods fall short when it comes to pinpoint where in the eval- uated speech a problem occurs. When informants fill in a questionnaire after listening to TTS, they are not able to point to particular instances of the speech to which they listened, but rather make general judgements. Gaze data, on the other hand, is synchro- nous, meaning that it may give insights about a person’s perception at the same moment it takes place. Swift, Campana, Allen and Tanenhaus (2002) first employed gaze data as an objec- tive measure for TTS evaluation. Using a temporal resolution of 60 Hz, they could cap- ture incremental recognition of single words, i.e. phoneme-by-phoneme processing. The experiment was quite specific: informants responded to an audio instruction to fixate one of many items situated in front of them while gaze movement was rec- orded. Subsequent researchers manipulated prosody in the audio instructions and 76 looked for facilitation; that is, when the informant completes the task faster than nor- mal, and also compared TTS to a gold standard of human speech. Van Hooijdonk, Commandeur, Cozijn, Krahmer and Marsi (2007) used the same experimental paradigm to determine that two consecutive instructions to select the same object facilitated object localization when prosodic marking was present. They also found an interaction between object and speech condition in TTS as opposed to human speech, indicating that anticipation and not prosody aided audio recognition of TTS with low intelligibility. More recently, White, Rajkumar, Ito and Speer (2014) tested prosody on two levels. A target word and its adjective were accented in one of two different ways: Condition 1: adj + noun = L*H* + no-acc; Condition 2: adj + noun = H* + H*, and words with L*H accent were hypothesized to be acoustically salient and therefore most likely to attract attention. Gaze tracking showed that the L*H* marking facilitated object local- ization when it occurred in human speech, but not in TTS. After acoustic and various metric analyses, they concluded that a combination of data from offline subjective measures (e.g. ratings) and online objective measures (e.g. gaze) can reveal differences between how people perceive and process synthetic as compared to human speech. When it comes to speaker changes, a recurring problem with data from real conver- sations is that one cannot be sure that a place where a speaker change occurred was actually a suitable place just because it occurred, as people sometimes flaunt conven- tions, and similarly one cannot be certain that a place where no speaker change occurred was an inappropriate place for a speaker change. Based on the observations that lookers-on of a conversation fixate the speaking per- son and redirect their gaze in expectation of a speaker change (Edlund et al. 2012), gaze tracking of 3rd party observers of conversations has been used to provide insights into speaker changes that might have occurred but did not, and those that occurred where it may not have been expected. Tice and Henetz (2011) introduced the method, and others have since then used it in different experimental settings with similar results (Edlund et al. 2012). In short, the paradigm consists of analysing data from informants viewing a recorded dyadic conversation, which is presented split-screen with one conversant on each side of the screen. The visual attention of the observer is used to assess the pre- dictability of speaker changes that occur, and to point at times where a speaker change could well have occurred but did not. Compared to standard means of annotation, this measure is continuous and depends on real-time perception of a human observer on very small time frames, which creates a signal with strong potential as a machine learn- ing feature. 4 Annotation with expert annotators Trained professionals are paid to annotate data. The standard way of getting transcrip- tions of speech, for example, is to pay transcribers to write down what is said, usually painstakingly following a detailed transcription manual. Other tasks are performed sim- ilarly. Phonetic segmentation, for example, is the task of splitting utterances into their phonetic units. Phonetically segmented speech is used for a wide range of purposes in 77 speech technology development, including the training of ASR and TTS. Automatic segmentation – or forced alignment – does this job very well for some recordings, but in other cases manual labour is still required to reach acceptable quality. Manual phonetic segmentation is an example where data sets on expert annotator gaze behaviour can add new layers of information to training data for developing auto- matic methods. Khan, Steiner, Sugano, Bulling and Macdonald (2018) captured gaze data and other behaviours from annotators required to draw exact time boundaries be- tween segments in spectrograms. Preliminary results on their data show improvements on automatic segmentation using the behaviour data. At KTH Speech, Music and Hear- ing, researchers are investigating a similar method where the gaze behaviours of a so- called Wizard-of-Oz, a person controlling a spoken dialogue system behind the scenes for data collection purposes are collected to be used as a feature for training. 5 Requirements as compared to other application areas In most fields where gaze tracking has been employed, there is a need to pay great attention to aspects such as control, exactness (which requires precise calibration), and clear instructions. Often, the goal of the tracking makes it necessary to know, for each moment in the experiment, exactly where the gaze rests. Conversely, the applications we discuss here generally require large amounts of statistical data, where ill effects of a small proportion of errors may be less damaging. On the other hand, ease of use is important (or the professionals will not agree), and calibration must be unobtrusive or even hidden (lest the presence of the instrument endangers the ecological validity of the experiment). In general, the requirements of these applications are more concerned with usability issues and less with experimental control. 6 Conclusions and next steps In summary, we believe that tracked gaze behaviours can boost a wide range of speech technology and spoken interaction research methods. However, gaze applications of this nature place different requirements on the gaze tracking technology, and the meth- ods are still in their infancy. Our work for the near future focusses on modelling of text to be read aloud with TTS and on further exploring the possibilities afforded by registering 3 rd party observers’ gaze behaviours. We predict, however, that as gaze tracking hardware becomes increas- ingly accessible, the range of applications of the type discussed here will quickly grow. 78 References [1] M. K. Eckstein, B. Guerra-Carrillo, A. T. Miller Singley, and S. A. Bunge, “Beyond eye gaze: What else can eyetracking reveal about cognition and cognitive development?,” Dev. Cogn. Neurosci., vol. 25, pp. 69–91, Jun. 2017. [2] K. Rayner, “Eye movements in Reading and Information Processing: 20 Years of Research.,” Psychol. Bull., 1998. [3] Z. Malisz, E. Brandt, B. Möbius, Y. M. Oh, and B. Andreeva, “Dimensions of Segmental Variability: Interaction of Prosody and Surprisal in Six Languages,” Front. Commun., vol. 3, p. 25, Jul. 2018. [4] M. Aylett and A. Turk, “The Smooth Signal Redundancy Hypothesis: A Functional Explanation for Relationships between Redundancy, Prosodic Prominence and Duration in Spontaneous Speech,” Lang. Speech, vol. 47, no. 1, pp. 31–56, 2004. [5] P. J. Schwanenflugel and K. L. LaCount, “Semantic Relatedness and the Scope of Facilitation for Upcoming Words in Sentences,” J. Exp. Psychol. Learn. Mem. Cogn., 1988. [6] M. Kurnik, “Bilingual lexical access in Reading,” Cent. Res. Biling. Dep. Swedish Lang. Multiling., 2016. [7] M. Vainio, A. Suni, and D. Aalto, “Emphasis, Word Prominence, and Continuous Wavelet Transform in the Control of HMM-Based Synthesis,” Springer, Berlin, Heidelberg, 2015, pp. 173–188. [8] M. D. Swift, E. Campana, J. F. Allen, and M. K. Tanenhaus, “Monitoring eye movements as an evaluation of synthesized speech,” Proc. 2002 IEEE Work. Speech Synth., no. November, pp. 19–22, 2002. [9] C. Van Hooijdonk, E. Commandeur, R. Cozijn, E. Krahmer, and E. Marsi, “Using eye movements for online evaluation of speech synthesis,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 1, no. August 2015, pp. 217–220, 2007. [10] M. White, R. Rajkumar, K. Ito, and S. R. Speer, “Eye tracking for the online evaluation of prosody in speech synthesis,” in Natural Language Generation in Interactive Systems, A. Stent and S. Bangalore, Eds. Cambridge: Cambridge University Press, 2014, pp. 281–301. [11] J. Edlund et al., “3rd party observer gaze as a continuous measure of dialogue flow,” in LREC 2012, 2012. [12] M. Tice and T. Henetz, “The eye gaze of 3rd party observers reflects turn-end boundary projectio,” 2011. [13] A. Khan, I. Steiner, Y. Sugano, A. Bulling, and R. Macdonald, “A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks,” in LREC 2018, 2018.