-

Measuring the E ect of ITS Feedback Messages on Students' Emotions

Han Jiang

Zewelanji Serpell

Jacob Whitehill

0 Virginia Commonwealth University , Richmond, VA 23284 , USA 1 Worcester Polytechnic Institute , Worcester, MA 01605 , USA

When an ITS gives supportive, empathetic, or motivational feedback messages to the learner, does it alter the learner's emotional state, and can the ITS detect the change? We investigated this question on a dataset of n = 36 African-American undergraduate students who interacted with iPad-based cognitive skills training software that issued various feedback messages. Using both automatic facial expression recognition and heart rate sensors, we estimated the e ect of the di erent messages on short-term changes to students' emotions. Our results indicate that, except for a few speci c messages (\Great Job", and \Good Job"), the evidence for the existence of such e ects was meager, and the e ect sizes were small. Moreover, for the \Good Job" and \Great Job" actions, the e ects can easily be explained by the student having recently scored a point, rather than the feedback itself. This suggests that the emotional impact of such feedback, at least in the particular context of our study, is either very small, or it is undetectable by heart rate or facial expression sensors.

intelligent tutoring systems emotion feedback facial expression analysis heart rate analysis

One of the main goals of contemporary research in intelligent tutoring systems (ITS) is to promote student learning by both sensing the student's emotions and responding with a ect-sensitive feedback that is appropriate to the student's cognitive and a ective state. For sensing students' emotions, a variety of methods are now available, including physiological measurements [ 18 ], facial expression analysis [ 23 ], and \sensor-free" approaches [ 15 ] based on analyzing the ITS logs. Given an estimate of what the student knows and how they feel, the tutor must then decide how to respond. Based on the intuition that good human tutors are often empathetic and supportive, many ITS today provide real-time \empathic feedback" to learners that tries to encourage and motivate them to keep learning. This feedback can range in complexity from short utterances [ 1, 9, 16, 6 ] to longer prompts [ 2, 9, 17, 14 ] such as growth-mindset [ 3 ] messages.

Empathic feedback messages could make learners' interactions with ITS more natural and e ective, but they also increase the complexity of designing the ITS Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). and its control policy, i.e., how it acts at each moment. Moreover, if feedback is given injudiciously, it could become distracting and suppress learning [ 6 ]. While a ect-aware ITS with empathic feedback have demonstrated some notable success [ 10, 23, 2 ], the sum of evidence of their bene t is unclear. Empathic feedback has often been evaluated as part of a treatment condition in which the feedback was not the only variable being manipulated [ 16, 2 ]. Moreover, optimistic hypothesis testing that did not account for multiple hypotheses was often used.

In this paper we investigate the instantaneous impact of ITS feedback on each student's emotional state. The context of our study is an iPad-based system for cognitive skills training [ 11 ], speci cally a task called \Set" (similar to the classic card game) in which the participants must reason about di erent dimensions (size, color, shape) of the shapes shown on the cards in order to score a point. The participants are African-American undergraduate students at a Historically Black College/University (HBCU). As measures of emotion, we consider facial expression, heart rate, and heart rate variability, all of which can be estimated automatically, in real time, and with a high temporal resolution.

We examine the following research questions: Is there an instantaneous change in facial expression and/or heart rate after each ITS feedback message that is consistent across the participants? Does the evidence for such a change persist even after taking possible confounds into account? Is there evidence that at least some participants may exhibit a relationship between the sensor readings and the prompts, even if not all of them do? Finally, is there evidence of any non-emotional change in students' behavior as a result of the feedback messages? 2

Related Work

Empathic Virtual Agents: [ 17 ] compared an \empathetic" avatar to a \nonempathetic" one. At the start of the experiment, the empathetic avatar would ask the user, \Hopefully, you will get more comfortable as we go along. Before we start, could I please have some of your information?" with the goal of building trust and comforting the participant. In contrast, the non-empathetic one would simply ask, \Have you participated in similar tests before?" They found that the empathetic agent performed no better, in terms of changing students' selfreported mood after the intervention, than the non-empathetic agent. However, they did nd in the questionnaire results that participants found the empathetic avatar to be more \enjoyable, caring, trustworthy, and likeable". In another study on virtual agents [ 19 ], the researchers compared an \empathic" virtual therapist with a \neutral" one. The empathic therapist was designed to respond to the participant \in a caring manner". For instance, at the start of the session, it would say, \I'm very happy to meet you and hope you'll nd our session together worthwhile. Please make yourself comfortable," whereas the neutral therapist would say simply, \Hello, I am E e a virtual human." The study found that the empathic therapist was bene cial, relative to the neutral therapist, only for a subset of participants; this is reminiscent of the study by [ 6 ] who found that the emotionally-adaptive ITS only helped students with less prior knowledge. Moreover, the bene t of the empathic therapist did not persist after the rst meeting between the participant and the agent.

Empathic ITS: In [ 20 ], the researchers assessed the impact of ITS empathic feedback on students' emotions by manually coding students' facial expressions (frustration, confusion, ow, etc.). They found that there was a di erence, in terms of the transition dynamics of students' a ective states (e.g., ow to boredom), between the feedback messages that were rated as \high-quality" versus \low-quality" by the students. [ 9 ] compared di erent types of ITS feedback { epistemic, neutral, and emotional { in terms of their impact on facial emotions. The epistemic feedback was more impactful than the emotional feedback in their study. However, their study did not compare to giving no feedback at all. In [ 14 ], feedback of di erent types { growth mindset, empathy, and success/failure { were compared in terms of students' subsequent self-reported emotions. Their results suggest that the di erent feedback conditions were associated with different emotions (interest, excitement, frustration, etc.). Widmer [ 26 ] employed a Wizard-of-Oz experimental design similar to ours to assess the bene t of prompts in ITS; they measured the impact on learning but not on students' emotions.

Multiple Hypotheses: Most prior studies on ITS feedback messages tested many hypotheses but did not statistically correct for this. It is thus possible that they were overly optimistic when identifying possible impacts. 3

Sensors of Emotion and Stress

In our work we investigate the impact of ITS feedback on emotion as it is expressed by facial muscle movements and changes in heart rate.

Heart Rate: Heart rate (HR) and heart rate variability (HRV) are well known and widely used as a biomarker of stress [ 24, 4, 18, 8 ]. To measure HR and HRV, we use a Polar heart monitor chest belt that is connected wirelessly to a laptop to record the inter-beat-interval (IBI) of heartbeats. We measure HR as the inverse of the IBI, and the HRV as the standard deviation of the IBI.

Facial Expression: Behavioral and medical science researchers have used facial expression as a way of assessing various mental states such as engagement [ 25 ], driver drowsiness [ 5 ], thermal comfort [ 12 ], and students' emotional states in ITS [ 21 ]. Facial expression sensor toolkits are now also used in several prominent intelligent tutoring systems [ 13, 23, 10 ]. In particular, we use the Emotient SDK from iMotions, which can recognize 20 Facial Action Units (1, 2, 4, 5, 6, 7, 9, 10, 12, 14, 15, 17, 18, 20, 23, 24, 25, 26, 28, 43) [ 7 ] and 12 emotions (anger, joy, sadness, neutral, contempt, surprise, fear, disgust, confusion, frustration, positive sentiment, negative sentiment). In each frame, the Emotient SDK could provide a numeric value for each facial expression if there is a face detected. 4

Dataset

In our analysis, we examined the HBCU2012 dataset [ 22 ] which is an extension of the HBCU dataset from [ 25 ]. In HBCU2012, n = 36 African-American under

Facial expressions from Webcam

PaSruticbijpeacntt iPad

Heart monitor t n e m e r u s a e m r o s n e S

“Yay” / Point scored “GooTd1Job!”

T2 (No event) ΔTy→gj

Poi“nYtasyc”o/red (NoTe3vent)

ΔTy→gj W

W Time graduate students interacted with iPad-based cognitive skills training software that is designed to strengthen basic cognitive processes such as working memory and logical reasoning. While interacting with the software, their facial expressions and heart rate is being recorded (see Figure 1). Each participant interacted with the ITS for 3-4 periods each, resulting in a total of 108 videos.

Procedure: Each student participated for 3-4 sessions, and each daily session lasted about 40 minutes. Although the system contains several tasks, the main task is called Set, which is similar to the classic card game. In this task, the player scores a point if they correctly group 3 cards together that have a correct con guration of size, shape, and color. When the student scores a point, the software automatically issues a \Yay!" sound. The Set task is highly demanding, particularly at the advanced di culty levels and given the time pressure. At the start of each daily session, the participant takes a 3min pretest. Then, they undergo 30min of cognitive skills training that is facilitated by the system. In particular, the tutor decides the di culty level at which the student practices, when to switch tasks to take a break, etc. The tutor also issues hints and prompts of di erent types (described below). During this practice section (but not during the tests), the student receives various feedback messages (see below). After the practice session, the participant takes a posttest.

Types of feedback: The tutor can issue feedback messages of various types (see Table 1). Some of them are empathetic, some are motivational, and some are goal-oriented. Note that each message type may be expressed with slightly di erent phrasing, e.g., \Good Job" might be spoken by the tutor as \Good Job" or just \Good"; "Try harder" can be expressed as either "It seems like you are not trying. Please try your hardest." or "Try harder."

Human-assisted ITS: While in many aspects the cognitive skills training software used to collect the HBCU2012 dataset was automated, the decisions of when to issue feedback messages were made by a human tutor (sometimes called the trainer in a cognitive skills training regime) who was either in another room (Wizard-of-Oz style) or in the same room (1-on-1 style) as the participant. For the Wizard-of-Oz setting, the trainer could watch the student's face via a live webcam and also observe the student's practice on a real-time synchronized iPad. Compared with a fully automated ITS, this human-assisted apparatus might actually yield feedback messages that are more appropriately timed and chosen than what an ITS would decide.

Sensor Measurements and Synchronization: Each participant completed the cognitive skills testing and training on an iPad. The inter-beat interval (IBI) of heartbeats was recorded using a Polar heart monitor. Facial expressions were estimated in each frame (30 Hz) of video recorded by a webcam connected to a laptop. The game log was recorded wirelessly from the iPad onto the laptop. Game log, heart rate, and facial expression events were synchronized by nding a common timepoint between the face video and game log. Since all participants received multiple feedback messages, we used a withinsubjects design. To assess whether the various messages were associated with any immediate change in students' emotions (see Figure 1), we measured the change in the average value of a speci c sensor (heart rate, heart rate variability, or one of the 20 AUs + 12 emotions) around the time (T1) when a speci c message was issued. Speci cally, we computed the average sensor value within a time window of length W=2 just after T1 and subtracted the corresponding average sensor value in the time window of length W=2 just before T1; this yields v. These values, at di erent times T1, constitute the treatment group of our study. Then, we computed the di erence v (after-before) at a random timepoint (T2) in the participant's time series that was not within 10 seconds of any other prompt.

These values, at di erent times T2, constitute the control group. By comparing v due to the treatment vs. the control group, we can estimate the e ect of the feedback message on the change in the sensor value. While this is not a truly causal inference approach, our methodology does eliminate the confound that could arise, for example, if the average sensor value tended to increase (or decrease) over time, e.g., due to fatigue.

Repeated Measures Design: Since we have multiple feedback messages and multiple days of participation for each student in our study, we use a repeated-measures design based on a linear mixed-e ect model, where the student ID is a random e ect. We then assess whether the presence (1) or absence (0) of the feedback message is statistically signi cantly related to the change v in a speci c sensor value (facial expression or heart rate value). We repeat this for all message types and sensor values.

Hypothesis Correction: Due to many hypotheses (di erent messages and sensor measurements) that are largely independent of each other and lack of strong prior belief that a relationship exists between any particular sensor and feedback message, we take a conservative approach and perform Bonferonni correction to the p-values: Instead of the traditional = 0:05 threshold, we require = 0:05=m, where m is the total number of hypotheses.

E ect Size: We quanti ed the e ect size in two ways, both of which are a form of Cohen's d statistic: (1) Global e ect size: we divided the xed-e ect model coe cient for the treatment by the standard deviation of the sensor value (e.g., happiness value) over the entire dataset (all participants, all days, and all times). (2) Local e ect size: we divided the xed-e ect model e cient for the treatment by the standard deviation of all v in the union of the treatment and control groups. This expresses whether the change due to the feedback message is large compared to changes that occur in other time windows of length W . 6 6.1

Analysis

Facial Expression Analysis Details: We followed the methodology described above, where we picked 20 time points (T2) per each video such that there are no other event 10s before or after them for the control group. For the time window W , we used 5s and 10s. We allowed for the possibility that the participants' reactions to the ITS feedback messages might be slightly delayed; hence, we conducted analyses with a \right-shift" parameter of either 0s or 1s. Finally, for the number of hypotheses m by which we corrected the p-value threshold , we considered that the 12 emotions (happy, sad, angry, etc.) can be considered combinations of individual Facial Action Units (AUs) [ 7 ] and are thus not independent of the 20 AUs we already measure. Since there are 13 di erent ITS feedback messages that we consider, we thus let m = 13 20 = 260 so that our threshold for statistical signi cance by Bonferonni correction is 0:05=260.

Results: Only 2 of the 13 feedback messages showed any stat. sig. impact, after p-value correction, on any of the 32 facial expressions for any of the rightshift values (0s, 1s) or window sizes (5s, 10s). The two message types were \Great Job" and \Good Job", and the e ects were signi cant across all combinations of W and . Table 2 show the facial expression values that have a signi cant change due to these feedback messages. Note that the e ect sizes are generally quite small, especially when assessed at a global level (i.e., relative to the variance of the expression value over the whole dataset). The largest absolute e ect size is for AU43 (closing of the eyes) for both \Good Job" and \Great Job", whereby the participants' eyes tend to be more closed before than after the message. Analysis Details: We varied W over 5s and 10s, and the trends were the same. For Bonferonni correction, we let m = 26 since we considered two di erent heart measures (HR, HRV) and there were 13 di erent message types.

Results: None of the prompts showed a stat. sig. impact on HR or HRV. 7

E ects on Individual Students

Here we consider the hypothesis that the feedback messages may a ect some students but not others. In particular, we test, for each combination of participant, feedback message, and sensor measurement, whether there is a statistically signi cant di erence within each student in the average sensor value W=2 seconds after vs. before the prompt. For each combination of prompt and sensor value, we then calculate the fraction of students for which the di erence is statistically signi cant. Importantly, this analysis allows for a di erent e ect { some positive, some negative { on each student.

Facial Expression: We perform the analysis for W = 10s. If, for each student, any of the 32 facial expression values were signi cantly changed due to a feedback message, then we increment our count for that message type. We let m (number of hypotheses) be 20 (the number of unique Facial Action Units we measure) and hence = 0:05=m =2.5e-03. The results shows that for most messages, less than one quarter of the students showed any e ect; only the \Great Job"(18/36) and \Good Job"(19/36) a ected at least half of the students

Heart Rate: We varied W over 5s and 10s, and the trends were similar. For Bonferonni correction for each participant, we let m = 2 since we considered two di erent heart measures (HR, and HRV). The trend is similar as for the facial expression measures (\Great Job": 16/36; \Good Job": 19/36). 8

Impact of \Great Job" and \Good Job" Messages

Our analyses have found robust (over multiple sensor measurements, right-shifts, and window sizes) evidence of a relationship between the \Great Job" and \Good Job" messages and facial expression (but not heart rate), despite the conservative Bonferonni correction. However, there was little evidence in support of any other feedback message. Given that these two message types almost always occur shortly after the student has scored a point, we explored whether the change due to the feedback itself or simply because the point scored a point. To examine this, we modify the methodology from Section 5 so that the control group for these messages is taken at times T3 that are y!gj after a \Yay"/point scored timepoint but where no such feedback occurs (see Figure 1). Importantly, the decision of whether or not \Good Job"/\Great Job" was given was at the discretion of the human trainer and was essentially random (i.e., quasi-experimental analysis). This allows us to isolate the e ect of the feedback itself, rather than of the preceding \Yay" sound. We estimated the value y!gj over all the \Great Job" and \Good Job" messages in our dataset (around 1.091s).

Analysis Details: We selected \Great Job" and \Good Job" timepoints T1 such that there is no other message before and after 5 seconds except a \Yay". We also randomly selected a similar number of time points for T3. We varied W as 5s or 10s, and we let be 0s or 1s. Since there are now just 2 feedback messages and 20 AUs, we let m = 40.

Results: After accounting for the preceding \Yay"/point-scored as described above, we nd no statistically signi cant change of any facial expression before vs. after the \Great Job" or \GoodJob", for any W or . This indicates that the change in facial expression around these messages is likely due to having scored a point, not the feedback itself.

Conclusions

Our analyses of facial expression and heart rate data from 36 African-American students interacting with iPad-based cognitive skills training software suggest that (1) the impact of the short empathic feedback messages on students' emotions was very small. (2) Several of the correlations (for \Good Job" and \Great Job") disappeared after we accounted for the confound that the student's own achievement at having scored a point could explain the impact. (3) When examining the emotional impact on individual students, we found that, except for \Great Job" and \Good Job", only a modest fraction of students showed any stat. sig. correlation. Therefore, before trying to optimize an empathic ITS' control policy, it may be worth verifying that the feedback messages have any impact at all. On the other hand, and more optimistically, contemporary emotional recognition systems also o er a pathway forward to measure the impact of the ITS' actions more precisely. Finally, we note that there could be non-emotional e ects of the ITS prompts on students' behaviors. For instance, when watching some videos, we noticed that a few participants shifted their eye gaze in response to the \Watch your time" prompt. Future work can explore this issue.

Acknowledgments: This research was supported by the NSF National AI Institute for Student-AI Teaming (iSAT) under grant DRL 2019805, and also by an NSF Cyberlearning grant 1822768. The opinions expressed are those of the authors and do not represent views of the NSF.

1. Andallaza , T.C.S. , Jimenez , R.J.M. : Design of an a ective agent for aplusix . Undergraduate thesis , Ateneo de Manila University, Quezon City ( 2012 )

2. Arroyo , I. , Woolf , B.P. , Cooper , D.G. , Burleson , W. , Muldner , K. : The impact of animated pedagogical agents on girls' and boys' emotions, attitudes, behaviors and learning . In: International Conference on Advanced Learning Technologies ( 2011 )

3. Claro , S. , Paunesku , D. , Dweck , C.S.: Growth mindset tempers the e ects of poverty on academic achievement . Proceedings of the National Academy of Sciences 113 ( 31 ), 8664 { 8668 ( 2016 )

4. De Manzano , O. , Theorell , T. , Harmat , L. , Ullen , F. : The psychophysiology of ow during piano playing . Emotion 10 ( 3 ), 301 ( 2010 )

5. Dwivedi , K. , Biswaranjan , K. , Sethi , A. : Drowsy driver detection using representation learning . In: International advanced computing conference (IACC) ( 2014 )

'Mello , S. , Lehman , B. , Sullins , J. , Daigle , R. , Combs , R. , Vogt , K. , Perkins , L. , Graesser , A. : A time for emoting: When a ect-sensitivity is and isn't e ective at promoting deep learning . In: Intelligent tutoring systems ( 2010 )

7. Ekman , R.: What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS) . Oxford UP , USA ( 1997 )

8. Feidakis , M. , Daradoumis , T. , Caballe , S. : Emotion measurement in intelligent tutoring systems: what, when and how to measure . In: International Conference on Intelligent Networking and Collaborative Systems ( 2011 )

9. Feng , S. , Stewart , J. , Clewley , D. , Graesser , A.C. : Emotional, epistemic, and neutral feedback in autotutor trialogues to improve reading comprehension . In: International Conference on Arti cial Intelligence in Education . Springer ( 2015 )

10. Graesser , A. , McDaniel , B. , Chipman , P. , Witherspoon , A. , D'Mello , S. , Gholson , B. : Detection of emotions during learning with autotutor . In: Proceedings of the 28th annual meetings of the cognitive science society . pp. 285 { 290 . Citeseer ( 2006 )

11. Hill , O.W. , Serpell , Z. , Faison , M.O.: The e cacy of the learningrx cognitive training program: modality and transfer e ects . The Journal of Experimental Education 84 ( 3 ), 600 { 620 ( 2016 )

12. Jiang , H. , Iandoli , M. , Van Dessel , S. , Liu, S. , Whitehill , J.: Measuring students' thermal comfort and its impact on learning . Educational Data Mining ( 2019 )

13. Joshi , A. , Allessio , D. , Magee , J. , Whitehill , J. , Arroyo , I. , Woolf , B. , Sclaro , S. , Betke , M.: A ect-driven learning outcomes prediction in intelligent tutoring systems . In: Automatic Face & Gesture Recognition ( 2019 )

14. Karumbaiah , S. , Lizarralde , R. , Allessio , D. , Woolf , B. , Arroyo , I. , Wixon , N.: Addressing student behavior and a ect with empathy and growth mindset . Educational Data Mining ( 2017 )

15. Lan , A.S. , Botelho , A. , Karumbaiah , S. , Baker , R.S. , He ernan, N.: Accurate and interpretable sensor-free a ect detectors via monotonic neural networks . In: International Conference on Learning Analytics & Knowledge ( 2020 )

16. Mondragon , A.L. , Nkambou , R. , Poirier , P. : Evaluating the e ectiveness of an a ective tutoring agent in specialized education . In: European conference on technology enhanced learning . pp. 446 { 452 . Springer ( 2016 )

17. Nguyen , H. , Mastho , J.: Designing empathic computers: the e ect of multimodal empathic feedback using animated agent . In: Proceedings of the 4th international conference on persuasive technology . pp. 1 { 9 ( 2009 )

18. Pham , P. , Wang , J. : Attentivelearner: improving mobile mooc learning via implicit heart rate tracking . In: International conference on arti cial intelligence in education . pp. 367 { 376 . Springer ( 2015 )

19. Ranjbartabar , H. , Richards , D. , Bilgin , A. , Kutay , C. : First impressions count! the role of the human's emotional state on rapport established with an empathic versus neutral virtual therapist . IEEE transactions on a ective computing ( 2019 )

20. Robison , J. , McQuiggan , S. , Lester , J.: Evaluating the consequences of a ective feedback in intelligent tutoring systems . In: A ective computing and intelligent interaction and workshops ( 2009 )

21. Sarrafzadeh , A. , Hosseini , H.G. , Fan , C. , Overmyer , S.P. : Facial expression analysis for estimating learner's emotional state in intelligent tutoring systems . In: International Conference on Advanced Technologies ( 2003 )

22. Saulter , L. , Thomas , K. , Lin , Y. , Whitehill , J. , Serpell , Z. : Detecting a ect over four days of cognitive training. Poster presented at the Temporal Dynamics of Learning Center All-Hands Meeting at UCSD ( 2013 )

23. Sawyer , R. , Smith , A. , Rowe , J. , Azevedo , R. , Lester , J.: Enhancing student models in game-based learning with facial expression recognition. In: User modeling, adaptation and personalization ( 2017 )

24. Thayer , J.F. , Ahs , F. , Fredrikson , M. , Sollers

III

, J.J. , Wager , T.D.: A metaanalysis of heart rate variability and neuroimaging studies: implications for heart rate variability as a marker of stress and health . Neuroscience & Biobehavioral Reviews 36 ( 2 ), 747 { 756 ( 2012 )

25. Whitehill , J. , Serpell , Z. , Lin , Y.C. , Foster , A. , Movellan , J.R.: The faces of engagement: Automatic recognition of student engagementfrom facial expressions . IEEE Transactions on A ective Computing 5 ( 1 ), 86 { 98 ( 2014 )

26. Widmer , C.L. : Examining the Impact of Dialogue Moves in Tutor-Learner Discourse Using a Wizard of Oz Technique . Ph.D. thesis , Miami University ( 2017 )