Understanding the Impact of Unexpected Cobot Movements on Human Stress Levels: A Time Series Classification Task Lior Shilon1,*,† , Vladyslav Shevtsov1,*,† , Nadia Schillreff2 , Anne Rother2 , Shlomo Mark1 and Myra Spiliopoulou2 1 Department of Software Engineering, Sami Shamoon College of Engineering, Ashdod, Israel 2 Faculty of Computer Science, Otto-von-Guericke-University, Magdeburg, Germany Abstract This study investigates the impact of unexpected stimuli on participants’ stress levels during human-robot interactions (HRI). We designed an experiment, where a cobot performs writing tasks, as well as some unexpected actions, intended to surprise the human observer. During this experiment, we are continu- ously monitoring physiological responses through Electrodermal Activity (EDA) and Electrocardiogram (ECG) sensors. Our goal was to capture the relationship between surprising actions by the cobot and indicators of change in the human stress levels. We employed time series classification techniques, namely ROCKET, Shapelet Transform and WEASEL, along with Support Vector Machine (SVM) analysis on temporal segments. This approach enabled us to develop a binary classification model on stress levels. Our research contributes to the design of AI systems that can detect changes in the stress levels of humans who interact with cobots that perform unexpected actions. Thus, we work towards the improvement of safety protocols in HRI. Keywords Human-Robot Interaction, Emotions in Human-Robot Interaction, Collaborative Robots, Time Series Classification 1. Introduction and Related Work One of the major safety concerns in HRI is the potential for human-robot collisions, which can cause significant injuries, even cobots, when equipped with a sharp tool, can no longer be safe. Rapid, unexpected movements by robots can lead to involuntary human responses, such as startle or surprise, increasing the likelihood of such collisions. Recognizing changes in the stress levels, is crucial for improving safety and efficiency in HRI. Understanding these emotional and physiological responses is key to designing safer and more efficient robot trajectories. Previous research has explored various aspects of HRI, particularly focusing on the detection and management of human emotions and physiological responses. Kirschner et al. [1] examined the effect of interactive user training on reducing human startle-surprise motion, highlighting HAII5.0: Embracing Human-Aware AI in Industry 5.0, at ECAI 2024, 19 October 2024, Santiago de Compostela, Spain. * Corresponding authors. † These authors contributed equally. $ liorsh9@ac.sce.ac.il (L. Shilon); vladish7@ac.sce.ac.il (V. Shevtsov); nadia.schillreff@ovgu.de (N. Schillreff); anne.rother@ovgu.de (A. Rother); marks@sce.ac.il (S. Mark); myra@ovgu.de (M. Spiliopoulou)  0000-0002-6768-5871 (A. Rother); 0000-0002-2484-3542 (S. Mark); 0000-0002-1828-5759 (M. Spiliopoulou) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings the importance of managing unexpected movements to prevent involuntary motions. This study demonstrated that hands-on training could significantly reduce involuntary motions, thereby enhancing safety in HRI. Our work is inspired by the need to recognize startle-surprise, even if it does not result in motion. As a first step in this direction, we conducted an experiment where the cobot is doing something surprising and we analyzed the stress levels of the human participant (one per experiment) before and after the surprising cobot motion. Understanding the impact of unexpected stimuli on stress levels during human-robot in- teractions (HRI) is crucial for designing safer and more effective collaborative robots (cobots). Previous research has established that unexpected events can induce stress, which can be measured through physiological responses such as EDA and ECG signals [2, 3]. However, it is important to distinguish between general physiological arousal and specific stress responses. In this study, we assume that interruptions by the cobot lead to increased physiological arousal, which we interpret as stress within the experimental context. It is important to note that our findings do not map one-to-one to physiological responses and that ECG data had to be discarded due to poor contact quality. Alonso-Martín et al. [4] presented a multi-modal emotion detection system that integrates voice and facial expression analysis to detect user emotions during HRI. The impact of surprising cobot motion on human stress levels is less investigated. Therefore, according to Klimek et al. [2], wearable EDA sensors are promising tools for detecting perceived stress, with field studies showing an average accuracy of 82.6% in predicting stress . The system, comprising GEVA (Gender and Emotion Voice Analysis) and GEFA (Gender and Emotion Facial Analysis), demonstrated high accuracy in emotion detection by combining the outputs of both channels through a decision rule. This multi-modal approach significantly improved detection rates compared to using individual channels. Inspired by this approach, we selected two physiological sensors, EDA and ECG, to capture human stress responses during HRI. However, few studies have focused on the specific impact of surprise elements introduced by robots on human stress levels. Therefore, in this work we pursue following Research question: how to model and detect change in a human’ stress levels, as caused by a cobot’s surprising actions in a within-lab experimental setting? To address this question, we capture physiological data of the human, namely Electrodermal Activity (EDA) signal, monitor cobot activity, propose a workflow that partitions the human signal time series into segments containing human EDA and ECG recordings during the cobot’s surprising action and segments of time where no such actions took place. Then, we deploy time series classification for segment separation and study the quality of the classifier. We postulate that high quality classification would indicate a change in the stress levels, which in turn would be associated to the only surprising event that occurs during the experiment, namely the cobot’s surprising actions. 2. Experimentally Capturing Human Surprise in HRI The primary aim of this experiment is to ascertain whether the proximity of a robotic arm engaged in writing, with the potential for unforeseen actions, exerts an influence on the stress levels of individuals. This influence will be quantified utilizing EDA and ECG sensors. The collected data are then used for stress level classification. Designing HRI systems should balance surprise and predictability to maintain both engage- ment and trust. In our experiment, we achieved this balance by having the robot perform predictable writing tasks for most of the time, while incorporating short sequences with un- expected, i.e. surprising actions performed by the cobot. Such an action had the form of a sudden interruption of the cobot’s writing activity and a movement of the cobot arm in another direction. Hereafter, we term this activity as ’surprising action (of the cobot). This design allowed us to measure the impact of unexpected actions on stress levels, while also considering the broader implications on engagement and trust as suggested by Law et al. [5]. 2.1. Assumption of Stress Induction We operate under the assumption that surprising actions of the cobot serve as indicators for time segments with change in stress levels. This assumption is based on established links between unexpected events and stress responses, as indicated by prior research Kirschner et al. [1]. Specifically, we expect that segments containing surprising actions will correspond to higher stress levels, as inferred from EDA signals. EDA has been shown to be a reliable indicator of stress, with wearable sensors demonstrating high accuracy in predicting perceived stress levels in various studies [2]. 2.2. Measurement of Physiological Arousal EDA is widely used to measure physiological arousal, which can indicate stress among other responses. Increased EDA reflects heightened sympathetic nervous system activity, often associated with stress. Studies have confirmed that EDA is an effective measure of stress, with reported accuracies between 42% and 100%, averaging 82.6% [2]. Although EDA measures general arousal and not exclusively stress, we label each segment containing a surprising action as ’positive’ in the sense of being likely to contain an increase in stress level, and a segment containing no surprising action as ’negative’. Additionally, ECG has been shown to be a reliable tool for measuring stress [3]. However, we discarded ECG data altogether due to poor sensor contact, which compromised data quality. 2.3. Experiment Design 2.3.1. Setup The experiments were conducted using the KUKA LBR iiwa robot 1 . We created and 3D-printed a tool that attaches to the robot’s flange and holds the writing implement (pen) used by the robot to write. We then programmed the robot to transcribe the input sentences onto a sheet of paper placed on a table in front of the human participant. 2.3.2. Types of surprising actions To model surprise, five different types of actions were designed: 1 https://www.kuka.com/en-de/products/robot-systems/industrial-robots/lbr-iiwa Kuka Robotics. http://www.kuka.com "Kuka" is a registered trademark of KUKA Deutschland GmbH. 1. The robot stops writing mid-sentence. 2. The robot returns to the starting position. 3. The robot makes a sudden movement towards the participant. 4. Small spline movement. (A small movement in space of the robotic arm close the base of the arm.) 5. Big spline movement. (A bigger movement that outreaches more towards the participants.) The first two actions were designed by Participant 1, the third by Participant 2, and the last two by a 3rd party that didn’t have the experiment run on them. actions 1 and 2 are non threatening movements (the robot is not moving towards the human), while actions 3, 4, and 5 differ in the amount of movement (for example, action 5 utilizes more workspace for the movement). We opted for this approach to minimize exposure to the surprises as much as possible, given our dual roles as both the experiment’s programmers and participants. To measure the impact of the Cobot’s actions on human stress levels, Electrodermal Activity (EDA) and Electrocardiogram (ECG) sensors are utilized. These sensors capture physiological responses indicative of stress and arousal levels: • Electrodermal Activity (EDA): Measures changes in skin conductance, reflecting sym- pathetic nervous system activity. Increased EDA suggests heightened arousal or stress. • Electrocardiogram (ECG): Records the electrical activity of the heart, providing insights into heart rate variability (HRV) and cardiac responses to stressors. 2.3.3. Hypothesis We hypothesize that the EDA/ECG levels of participants will differ between the segments that contain a surprising actions of the cobot and the segments that do not. 2.3.4. Procedure in the Experiment Pre-Experiment Setup: (1) Position the robot in front of the table and secure it to prevent unwanted movements. (2) Place and secure the paper on the table using tape. (3) Attach EDA and ECG sensors to participants according to manufacturer instructions. (4) Set up cameras to record the experiment. Note that the camera is used to identify the exact time points where each surprising action started and ended, so that the segments with the physiological signal can be marked accordingly. Experimental Phase: (1) The robotic arm writes a text on a surface; this is visible to the participant of the experiment. (2) Each participant undergoes two experiments/sessions, with up to 6 surprising actions per experiment. Each surprising action has a 0-37% chance of occurring randomly during the writing phase, with a minimum of 5-10 movements between surprising actions. surprising actions that haven’t happened yet will take precedent over ones that did. Data Collection: (1) Record EDA and ECG readings throughout the experiment. (2) Prompt participants to rate their stress levels after each session. (3) Record any additional observations or qualitative data relevant to stress levels. Post-Experiment tasks: (1) Remove the sensors from the participants’ hands. (2) Log the sensor data. (3) Shut down the robot. 2.4. Building Time Segments for Classification on Surprise We segmented the recorded physiological data (EDA and ECG) in 1-second intervals. We also used video recordings, which helped us to identify and mark in the time series the precise starting point and ending point of each surprising action. For EDA data, we use a minimum segmentation of 20 units, and for ECG data, 25 units, accommodating the longest observed interruption. Then, we assigned labels to the segments, as explained in subsection 2.2, where ’positive’ corresponds to class 1 and ’negative’ to class 0. We used these segments to train and test and time series classifier. We show the concrete EDA time series of Participant 1 in the two plots of Figure 1, and alike the EDA time series of Participant 2 in the two plots of Figure 2. We marked the 0-segments as red and the 1-segments as green. The starting time point in each time series is the moment where the participant taps at the sensor - and then the recording starts. Figure 1: The EDA time series of participant 1, first then second experiment: segments with surprising actions (class 1) are marked in green, segments without (class 0) are marked in red Figure 2: The EDA time series of participant 2, first then second experiment: segments with surprising actions (class 1) are marked in green, segments without (class 0) are marked in red As can be seen in the two figures, and particularly in Figure 2, some peaks at the stress levels are after or even before the time interval where the surprising action took place: for a case with a peak well before the surprising action, refer to time point 36 of the first experiment of participant 2. Moreover, there are segments with peaks, although there is no surprising action. This can be attributed to some external event that caught the participant’s attention, or to some misalignment between video recording and EDA recording. These artifacts were ignored during time series classification, but it is evident that they made the classification task much more difficult. Notwithstanding the aforementioned artifacts, Table 1 depicts the number of segments per participant and experiment; it is evident that the class distribution is skewed, class 1 is underrepresented. Note that we show only the EDA time series, because the ECG signals were not properly captured and had to be discarded. 2.5. Preparing the Data for Time Series Classification Algorithms Building on the segmented data from the previous subsection, we carried out additional steps to make the data suitable for classification. # positive (1) segments # negative (0) segments Total Participant 1 Experiment 1 3 39 42 Experiment 2 3 40 43 Participant 2 Experiment 1 1 39 40 Experiment 2 2 39 41 Table 1 Distribution of positive and negative segments per participant and experiment We computed statistical features for each segment for the SVM classifiers, including the mean, standard deviation, and median. We also tabulated the count of positive (stress response) and negative (no stress response) instances. For classification, we used several algorithms including ROCKET, Shapelet Transform, and WEASEL classifiers. Additionally, we trained three SVM classifiers with different kernels (linear, RBF, and sigmoid) on both 1st participant’s and 2nd participant’s experimental data, testing them across each other’s datasets using the extracted features. 2.6. Time Series Classification Models 2.6.1. ROCKET According to Dempster et al. [6], "ROCKET transforms time series using a large number of random convolutional kernels, i.e., kernels with random length, bias, weights, dilation, and padding." Essentially, ROCKET breaks the time series data into small windows, identifying the highest point (global max) and calculating the proportion of positive values (ppv) [6]. In our context, a time series is a segment. 2.6.2. WEASEL The WEASEL algorithm, as described by Schäfer and Leser [7], uses a sliding window to extract sub-sequences, then transforms each one into a feature vector by applying Symbolic Fourier Approximation (SFA) Schäfer and Leser [7]. As with ROCKET, the inputs to WEASEL are the segments. 2.6.3. Shapelet Transform The Shapelet Transform method by Bostrom and Bagnall [8] and Hills et al. [9] selects randomly a subset of sub-sequences from the time series data as potential shapelets, thus avoiding exhaustive search for shapelets of all possible lengths. These potential shapelets are compared to all possible sub-sequences in the data [8, 9]. For our data, these are sub-sequences of each segment, keeping in mind that segments may vary in length. Finally, features are extracted based on the similarity between each time series and the selected shapelets [8, 9]. 2.7. Training and Testing We constructed Time Series Classifier (TSC) models using the sktime library, specifically employing the ROCKET Classifier (multirocket), WEASEL, and Shapelet Transform classifiers. A total of nine time series models were trained. The models included: one trained solely on 1st participant’s first experiment data; one trained on both 1st participant’s first and second experiment data; and one trained on 1st participant’s first and second experiment data in addition to 2nd participant’s first experiment data. All models were evaluated using data from 2nd participant’s second experiment, which was selected due to its higher frequency of interruptions. 3. Results Our results are depicted on Table 2, where we evaluate on precision, recall and F1 towards the positive class 1. We didn’t add SVM to the table because every SVM model we trained predicted only 0 for everything. It is evident that no model could cope with the extreme skew in favour of class 0: ROCKET focussed one class 1 when training on participant 1, Shapelet Generation did so only when using all data from participant 1, while WEASEL concentrated on class 0 throughout. The usage of the data of the second participant did not contribute to model quality. This indicates that oversampling of class 1 or under-sampling of class 0 or both should be done before learning on such skewed data. Classifier Scenario Precision Recall F1 ROCKET Experiment 1 of Participant 1 0.29 1.00 0.44 Experiments 1 & 2 of Participant 1 0.20 1.00 0.33 Experiments 1 & 2 of Participant 1, Experiment 1 of 0.00 0.00 0.00 Participant 2 WEASEL Experiment 1 of Participant 1 0.00 0.00 0.00 Experiments 1 & 2 of Participant 1 0.00 0.00 0.00 Experiments 1 & 2 of Participant 1, Experiment 1 of 0.00 0.00 0.00 Participant 2 Shapelet Generation Experiment 1 of Participant 1 0.00 0.00 0.00 Experiments 1 & 2 of Participant 1 0.50 1.00 0.67 Experiments 1 & 2 of Participant 1, Experiment 1 of 0.00 0.00 0.00 Participant 2 Table 2 Performance metrics for each scenario and model 4. Conclusion We presented an experiment design to measure change in stress levels due to surprising cobot actions, and a framework for the analysis of the experimental data. We measured electrodermal activity (EDA) of the participants, and we partitioned the EDA time series into segments that correspond or do not correspond to surprising actions (positive/negative). The analysis encompassed the separation between segments of the two classes with time series classifiers. For this, we created a workflow that allowed the comparison of models build by different state of the art time series classification algorithms. For this workflow, we also used several evaluation criteria, namely precision, recall and F1 for each of the classes, albeit we only reported for the rare class here. The purpose of the analysis was to investigate to what extend it is possible to distinguish between the two types of segments by considering only EDA as stress indicator. The hypothesis could not be tested because the number of positive segments was too small. Furthermore, the presence of peaks in the stress levels well after or even well before the actual surprising action indicates either that the participant was distracted or that the time interval of the surprising action could not be reproduced properly from the video; this corresponds to a misalignment between video recording and time series of body signal. This affected the classification negatively. Moreover, the experimental data were reduced because the ECG signal recordings indicated a sensor fault and had to be ignored. A further limitation was the small number of experiment participants. Nonetheless, the data skew was the primary limiting factor, and should be mitigated in future work through oversampling of the positive class and/or under-sampling of the negative class. While under-sampling is straightforward, oversampling of temporal segments may demand the generation of synthetic time series, which is itself challenging in the case of so few positive samples. We suspect that the duration of the surprising actions also affected the EDA recordings. In particular, surprising actions of long duration translate into long segments, where a short peak in the stress level might be misinterpreted as noise. Hence, future experiments should achieve a balance between duration of the surprising actions and number of such actions. Finally, a larger number of participants should be recruited. Then, it will become interesting to investigate to what extend participant-specific models can be augmented with data or models of other participants. Acknowledgments This research was supported by a grant from the Erasmus+ Program of the European Union, which enabled us to spend a semester at Otto von Guericke University Magdeburg during the summer. We would like to express our gratitude for their invaluable support and funding, which made this work possible. Special thanks to the Erasmus+ team for their dedication to fostering academic cooperation and innovation across Europe. References [1] R. Kirschner, L. Burr, P. M., H. Mayer, S. Abdolshah, S. Haddadin, Involuntary motion in human-robot interaction: Effect of interactive user training on the occurrence of human startle-surprise motion, in: 2021 IEEE International Conference on Intelligence and Safety for Robotics, 2021. [2] A. Klimek, I. Mannheim, G. Schouten, E. J. M. Wouters, M. W. H. Peeters, Wearables measuring electrodermal activity to assess perceived stress in care: a scoping review, Acta Neuropsychiatrica (2023) 1–11. doi:10.1017/neu.2023.19. [3] M. Naeem, S. A. Fawzi, H. Anwar, A. S. Malek, Wearable ecg systems for accurate mental stress detection: a scoping review, Journal of Public Health (2023). URL: https://doi.org/10. 1007/s10389-023-02099-6. doi:10.1007/s10389-023-02099-6. [4] F. Alonso-Martín, M. Malfaz, J. Sequeira, J. F. Gorostiza, M. A. Salichs, A multimodal emotion detection system during human–robot interaction, Sensors 13 (2013) 15549–15581. URL: https://www.mdpi.com/1424-8220/13/11/15549. doi:10.3390/s131115549. [5] E. Law, V. Cai, Q. F. Liu, S. Sasy, J. Goh, A. Blidaru, D. Kulić, A wizard-of-oz study of curiosity in human-robot interaction, in: 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), 2017, pp. 607–614. doi:10.1109/ROMAN. 2017.8172365. [6] A. Dempster, F. Petitjean, G. I. Webb, Rocket: exceptionally fast and accurate time series classification using random convolutional kernels, Data Mining and Knowl- edge Discovery 34 (2020) 1454–1495. URL: https://doi.org/10.1007/s10618-020-00701-z. doi:10.1007/s10618-020-00701-z. [7] P. Schäfer, U. Leser, Fast and accurate time series classification with weasel, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, ACM, 2017. URL: http://dx.doi.org/10.1145/3132847.3132980. doi:10.1145/3132847. 3132980. [8] A. G. Bostrom, A. J. Bagnall, Binary shapelet transform for multiclass time series classification, Trans. Large Scale Data Knowl. Centered Syst. 32 (2015) 24–46. URL: https://api.semanticscholar.org/CorpusID:265852316. [9] J. Hills, J. Lines, E. Baranauskas, J. Mapp, A. Bagnall, Classification of time series by shapelet transformation, Data Mining and Knowledge Discovery 28 (2013). doi:10.1007/ s10618-013-0322-1.