Detecting Lies in the Wild: Creativity and Learning @ the Maker Faire Rome Dario Pasquali1,* , Francesco Rea2 and Alessandra Sciutti1 1 COgNiTive Architectures for Collaborative Technologies (CONTACT) - Istituto Italiano di Tecnologia (IIT), Via Enrico Melen 83, Genova (Italy) 2 Robotics Brains and Cognitive Sciences (RBCS) - Istituto Italiano di Tecnologia (IIT), Via Enrico Melen 83, Genova (Italy) Abstract Creativity is one of the most powerful skills humans can rely on to overcome daily challenges. While most of the research has focused on the positive facet of creativity, like problem-solving and art, a few contributions explored its dark side: lying and deception. Virtual and embodied intelligent agents approaching the real world will soon face humans’ deception with poor means to understand and unmask it. In a previous study, we asked participants to describe a set of gaming cards to the humanoid robot iCub, either describing what they saw or producing a creative and deceiving description; The robot autonomously classified players’ behavior with a pupillometry-based heuristic method. After collecting an in-laboratory dataset, we trained a Random Forest classifier enabling the humanoid robot iCub to detect lie-related creativity autonomously during informal human-robot interactions. In this manuscript, aiming at real-world applications, we challenged our classifier on a diametrically opposite environment: the Maker Faire Rome 2022. Moreover, we compared its performance with respect to an Adaptive Random Forest twin, able to learn online after each interaction. The performance of the two models and the detection of concept and data drift give relevant insight into how adaptivity would be the key to developing more effective intelligent agents. Keywords Lie Detection, Creativity, Human-Robot Interaction, Incremental Learning, In the wild 1. Introduction Humans’ creativity is highly subjective; from problem-solving to art production, humans rely on creativity to survive and develop nowadays’ society. Traditionally, creativity has been related to the originality and appropriateness of people’s creative products and the ability to generate novel and effective ideas [1]. While literature mainly focuses on the more intuitive, art-related understanding of creativity, recent studies started exploring the creative process embedded in lying and deception [2, 3, 4, 5]. Indeed, creativity could also be used for negative purposes, with different degrees of malice [2]. Focusing on everyday social interaction, the "white lies" - lies not meant to harm others - are the most diffused negative creative attempts. Everybody lies [6], with an average of two lies per day [7] or even higher frequencies - 60% of the participants in CREAI 2022 - Workshop on Artificial Intelligence and Creativity, November 28 - December 2, 2022, Udine, IT * Corresponding author. $ dario.pasquali@iit.it (D. Pasquali); francesco.rea@iit.it (F. Rea); alessandra.sciutti@iit.it (A. Sciutti)  0000-0001-8185-8188 (D. Pasquali); 0000-0001-8535-223X (F. Rea); 0000-0002-1056-3398 (A. Sciutti) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) [8] lied at least once in a 10-minutes dialogue. For instance, we lie to present ourselves better than we are [7], to persuade others [9], or to avoid undesired conversations [10]. Despite the high impact of lying on social interactions, humans perform poorly on recognizing liars - the average accuracy is 47% on lie detection and 65% on recognizing true statements [11]. Several technical attempts have been developed to compensate for our poor performance and grasp the lying creative process. State-of-the-art solutions usually monitor the physiological proxies (e.g., skin conductance, respiration rate, heartbeat, blood pressure, or pupil diameter) known to reflect cognitive load and emotional arousal variations. Indeed, it has been proved how the fabrication and maintenance of a creative and coherent lie produce an increased cognitive effort [7, 12] and emotional arousal [13] with respect to truth-telling. Traditional lie detection methods rely on fMRI images [14], skin temperature variation [15], micro-expressions [16], photoplethysmography [17] or acoustic prosody [18]. Finally, the polygraph - the most famous "truth machine", debunked in [12] - merges a multi-modal set of physiological measures [19]. However, most of these methods are invasive, expensive, not autonomous, or dependent on expert figures, making them poorly applicable in everyday social interactions. The problem becomes even more crucial when considering the novel artificial agents that will soon be part of our society. Artificially intelligent agents, either virtual or embodied in robots, will sooner or later clash with humans’ deceptive behaviors, with scarce means to comprehend them or to understand when others are trustworthy [20, 21, 22]. Robots could hardly equip most of the mentioned technical solutions without requiring an expert operator or affecting the (in)formality of the social Human-Robot Interaction (HRI) (e.g., by touching the human partners to assess their skin conductance). Also, intelligent agents are usually trained in closed laboratory environments under strict and controlled context assumptions; hence, they would lack humans’ ability to adapt and learn from daily experiences. 1.1. Detecting Lies in Human-Robot Interaction This study is part of a four-year research project to enable the humanoid robot iCub [23] to detect lies in social human-robot interactions (HRI). We identified pupillometry [24, 25, 26, 27], in particular the Task Evoked Pupillary Responses (TEPRs) [28], as a minimally invasive proxy to detect deceptive behaviors. Such measures have proved to reflect cognitive load increases relative to deception and lie creation with respect to truth-telling [29]. We opted for this metric because it is hardly intentionally controllable and is measurable with unobtrusive devices usually embedded in everyday objects (i.e., glasses) [30, 31]. Furthermore, recent findings prove it would soon be feasible to measure TEPRs from standard RGB cameras [32, 33, 34] (e.g., the ones equipped on robots). Such systems still require users to stay uncomfortably close to the camera; however, they make pupillometry-based lie detection a good candidate for human-robot interaction (HRI). Based on these assumptions, we first proved that the same TEPRs studied in human-human interaction also happen in HRI during a formal interrogatory-like scenario [35, 36]. However, as other attempts in literature [37, 38], our findings were limited to formal interrogatory contexts. Hence, we explored whether the same effects would happen in informal interactions. For this purpose, we generalized the concept of lying with its core component: creativity. To be precise, it is wrong to consider any creative attempt as a lie; however, lying implicitly embeds a creative effort [3]. Hence, modeling humans’ creative process could help intelligent systems to better grasp lying and deception during everyday social interaction. For this purpose, we asked 39 participants to play a game with the humanoid robot iCub. Players described a set of gaming cards from the Dixit Journey card game1 to the robot either in a descriptive - i.e., by narrating what they saw in the card -, or in a creative way - i.e., by inventing a fake card description. ICub detected players’ creative attempts, autonomously and in real-time, only based on their pupil dilation, streamed from a Tobii Pro Glasses 2 eye-tracker. More precisely, the game was composed of two phases. In the first phase, the players described six cards, creating a fake description only for one of them they knew in advance; the iCub achieved an accuracy of 83% on detecting the single creative description by selecting the one relative to the highest mean pupil dilation among the six [39]. Moreover, the robot leveraged this brief interaction to learn an internal model of how the specific human partner was descriptive and creative: it stored the mean pupil dilation for the creative card description (creative reference score), and the average of the mean pupil dilation for the other descriptions (descriptive reference score). In the second phase, the iCub used this model to classify further card descriptions as creative or descriptive. Players were asked to describe six new cards, deciding for each one whether to narrate what they saw or fabricate a fake description. The robot computed the absolute distance of the mean pupil dilation for each new card description with respect to the two reference scores and assigned the label relative to the closest one. Such a simple heuristic allowed iCub to achieve an accuracy of 73% [20]. Posthoc, we used the second-phase data - 409 datapoints re-balanced through the SMOTE algorithm [40] - to train a Random Forest classifier to detect creativity with an F1-score of 71%. This model is supposed to be more generic and robust than the subjective heuristic employed during the game. Aiming at applying our system in everyday life one important question remains open: How would our creative detection system perform "in the wild"? 1.2. Online Interactive Adaptation Moving in-lab trained machine learning systems to a real context is non-trivial. Some potential issues that could arise are relative to the specific solution adopted: in our case, the Tobii eye- tracker is sensitive to environmental illumination; also, players’ cognitive load and arousal depend on a multitude of factors (e.g., emotion, stress, presence of distractions, . . . ), other than the act of being descriptive or creative. Other issues are known to affect all machine learning models ported in realistic environments: Data Drift, the distribution of data in the real world might be different from the ones on which the model is trained; and Concept Drift, the relation between the dependent and independent variables (i.e., the pattern learned by the model) can be different [41]. In both cases, the result is a decreasing performance over time. Training the models on a more extensive and representative dataset should help, but it could not be enough or even possible, like in our lie detection scenario. 1 https://boardgamegeek.com/boardgame/121288/dixit-journey A more feasible solution to face data and concept drift is online learning. As students need to adapt what they learned once they start their first job after university, also machine learning models should be able to learn and adapt to the real environment wherein they are employed. This is particularly true for models for social human-robot interaction (HRI). Through experience and interactive feedback from human pals, it would be possible to improve and adapt in-lab machine-learning models to everyday life. Figure 1: (Left) The Maker Faire game setup with the real-time plot of visitors’ pupil dilation on the screen, the Tobii Pro Glasses 2 eye-tracker, and Dixit Journey cards; (Right) and example of the described card. For this purpose, we challenged our Random Forest creativity classifier and explored the effect of online interactive adaptation on a diametrically opposite context: the 10th Maker Faire Rome 20222 (see Figure 1). Visitors played a simplified version of our interactive card game, adapted as a Human-Computer Interaction demonstration. As in the laboratory, the game tried to classify players’ descriptive and creative card descriptions based on pupil dilation, real-time streamed from the Tobii eye-tracker. Two classifiers processed visitors’ pupillometry in parallel: Random Forest (RF) The model is a simplified version of the one presented in [20]. It is pre-trained on 409 card descriptions (225 truthful and 184 deceptive) collected in our in- lab experiment; for each card, the following features are computed: duration of the descrip- tion; average, minimum, maximum, skewness, absolute energy and slope of the pupil dilation. We randomly selected 75% as training set, re-balanced it with the SMOTE algorithm, and performed a 4-fold grid-search cross-validation to identify the following best parameters: n_estimators=5, split_criterion=entropy, max_depth=2, max_features=log2, bootstrap=True. The best model achieved an accuracy of 71.8% and an F1-score of 70.7% on the test set. Finally, we trained a comprehensive model on the full dataset, achieving a training accuracy of 70.7% and an F1-score of 68.8%. 2 https://makerfairerome.eu/en/ Adaptive Random Forest (ARF) As a comparison, we trained an Adaptive Random Forest3 [42] able to both start from a pre-trained knowledge and to adapt online based on visitors’ interactive feedback. We used the same dataset and features of the RF classifier; however, we selected the following parameters: n_estimators=5, max_features=log2, split_criterion=nba, and max_depth=2 - please notice the entropy-based split criterion is not implemented in the python library we used to implement the ARF. The model, learning from one datapoint after the other of the in-laboratory dataset, achieved an accuracy of 68.5% and an F1-score of 62.8%. Even if the performance is worse than the other model, the Adaptive Random Forest can improve online. 2. Methods The Maker Faire took place in Rome from Friday the 7th to Sunday the 9th of October 2022. At the expo, we presented the game as one of the possible scientific demonstration visitors could interact with at our stand. The game was meant to collect data in the wild and disseminate our research, teaching visitors about pupillometry, its relation to cognitive load, and its potential application in robotics and other fields. 146 visitors played the game (58 on Friday, 45 on Saturday, and 43 on Sunday). 2.1. Setup Players played the game while standing. A graphical user interface was presented in front of them on a 15" laptop. Next to the computer lied the Tobii Pro Glasses 2 eye-tracker and the deck of 80 Dixit Journey gaming cards (see Figure 1). 2.2. Procedure The experimenter explained to the visitors that, through the Tobii Pro Glasses 2 eye-tracker, the game would read their right-eye pupil dilation, stream it in real-time, and store it in an anonymized format. If the visitors agreed to collect such data, the experimenter asked them to wear the eye-tracker and started a new match - we did not perform the eye-tracker calibration as not necessary to measure pupil dilation [43]. The game showed a real-time plot of visitors’ right-eye pupil diameter. For dissemination purposes, the experimenter explained how pupils change due to environmental illumination (i.e., asking them to look at a lamp and then watch the pupil decrease) and cognitive effort (i.e., asking them to perform math calculations) - the procedure was also meant to verify the correct measurement of the device. Then, the match started. For each match, players were asked to describe at least three cards: for the first one, they had to describe what they saw (i.e., be descriptive); for the second one, they had to be creative and to deceive; from the third card onward they could decide whether to be descriptive or creative. For each card, the game tried to classify it as descriptive or creative, provided a classification to the visitor, and asked for honest feedback. Please notice that classifications were also produced for the first two cards; however, showing them to the visitors was meaningless since they 3 https://riverml.xyz/dev/api/ensemble/AdaptiveRandomForestClassifier/ were instructed on how to behave. The controlled procedure was meant to collect a dataset as balanced as possible while letting players challenge the system as they wanted. Visitors were not limited in the descriptions’ duration nor the number of described cards. The experimenter manually clicked a GUI button to start and stop the data collection during the card description and validated or rejected the classification based on visitors’ feedback. Once visitors decided to end the match, the experimenter removed the eye-tracker and explained how the data were processed and the classification performed (see next section). 2.3. Data Processing Visitors’ right-eye pupil dilation was streamed from the Tobii eye-tracker at a frequency of 20 Hz. The game continuously accumulates data in a 1-second baseline queue (i.e., 20 datapoints). During the descriptions (i.e., between the start and stop GUI button clicks), the pupil data were instead stored in a separate buffer specific to that card. At the end of the description, both the baseline and the timeseries relative to the card were cleaned with a median outlier filter and smoothing sliding window; the timeseries was baseline-normalized, subtracting the mean pupil dilation during the baseline [27], and the features (see section 1.2) were computed. Both models classified the card description in parallel; however, only the Adaptive Random Forest classification was presented to the visitor. We opted for this solution to disseminate another piece of science: an incremental interactive model able to progressively improve thanks to visitors’ interactions. Finally, the data - both raw and cleaned timeseries and baseline, along with the extracted features - and the real label for each description were stored for posthoc analysis. 3. Results The 146 visitors described 532 cards - an average of 3.75 (SD=0.90) cards each. They were free to decide whether to be creative or descriptive from the third card onward; hence, the card dataset is slightly unbalanced with 274 creative and 258 descriptive attempts. Starting from the stored features, we applied a standardized outlier detection procedure: we discarded (i) all the descriptions associated with average pupil dilation, slope and duration far more than three times the subjective mean for each feature; and (ii) all the descriptions shorter than 1 second. Also, we excluded 4 visitors since they decided to leave after performing only one or two descriptions. The resulting dataset comprises 515 descriptions (269 creative and 246 descriptive). 3.1. In-Game Performance The Random Forest (RF) classifier achieved an accuracy of 59.1%, while the Adaptive Random Forest (ARF) only achieved a 54.4% of accuracy. However, a Chi-square Pearson’s test showed that the latter performance was not statistically lower (z=1.51, p=0.065). On the F1-score instead, the RF classifier achieved a performance of 55.2%, significantly lower than the 66.3% achieved by the ARF (z=-2.70, p=0.003). With respect to their in-lab performances, both models worsened: the Chi-square test showed significantly lower accuracy for both RF (z=3.66, p<0.001) and ARF (z=4.40, p<0.001); however, while the Random Forest F1-score was significantly lower in-the-wild (z=2.86, p=0.002), the difference was higher - but not significantly - for the Adaptive counterpart (z=-0.79, p=0.22). To better understand such performance differences, we explored the presence of data and concept drift on the fair with respect to the lab dataset. We composed a comprehensive dataset, including data from both environments. A Shapiro-Wilk normality test showed that the data were not normally distributed, so we opted for a non-parametric analysis. 3.1.1. Data Drift Analysis In our context, a data drift would represent a different distribution of the behavioral and pupil- lometry features between the two environments. We fitted a set of mixed effect models, one for each feature (duration of the description; average, minimum, maximum, skewness, absolute energy and slope of the pupil dilation), as dependent variable; we entered a fixed effect "environment" (two levels: lab, faire; with reference on the lab), and participants’ ID as random effect. Mixed effect models can represent the differences in distribution given by an independent fixed factor (i.e., the environments), while taking into account and grouping datapoints characterized by the same subjective random difference (i.e., the set of cards described by the same participant). Furthermore, they allow using the entire dataset, even if the distributions are unbalanced (e.g., the different number of described cards and participants in the two environments) Participants took slightly more time to describe the cards at the faire (B=4.54, t=2.55, p-0.012); we speculate this could be due to the more informal context and the lack of the turn-taking mechanic of the original game. Regarding the pupillometry features, participants’ average (B=-0.151, t=-3.68, p<0.001), minimum (B=-0.317, t=-6.75, p<0.001) and absolute energy (B=-81.9, t=-9.24, p<0.001) of the pupil dilation were lower at the faire. Finally, the maximum (B=-0.018, t=-0.358, p=0.721), skewness (B=0.014, t=0.315, p=0.753) and slope (B=-2.11, t=-1.51, p=0.134) of the pupil dilation were not statistically different. We speculate this difference could have been caused by the higher illumination of the fair, inducing, and averagely lower pupil dilation. 3.1.2. Concept Drift Analysis Then, we looked for concept drifts, i.e., differences in the relationship between descriptive and creative attempts in the two environments. We fitted another set of mixed effects models on the same features; we entered two fixed effects "environment" (two levels: lab, faire; with reference on the lab) and "behavior" (two levels: descriptive, creative; with reference on descriptive), along with a random effect on participants’ ID. Interestingly, only the maximum - lab:(B=0.194, t=6.32, p<0.001); faire:(B=0.92, t=3.61, p<0.001) - and the slope - lab:(B=12.0, t=10.9, p<0.001); faire:(B=3.6, t=3.42, p<0.001) - of the pupil dilation preserved the same pattern among the two environments: they were higher when being creative, even if the effect was stronger in the lab. On the duration, the difference between creative and descriptive, not significant in the lab (B=-0.134, t=0.121, p=0.904), was highly significant at the faire (B=5.542, t=5.7, p<0.001), with longer descriptions for creative attempts. On the contrary, the average - lab:(B=0.243, t=9.2, p<0.001); faire:(B=0.01, t=0.521, p=0.603) -, absolute energy - lab:(B=34.84, t=4.99, p<0.001); faire:(B=0.03, t=0.01, p=0.996) -, and skewness - lab:(B=-0.289, t=-9.73, p<0.001); faire:(B=0.02, t=0.69, p=0.493) - of the pupil dilation were higher when being creative in the lab, but not at the faire. Finally, the minimum pupil dilation showed a significant effect in both environments, but with opposite valence: it was higher when being creative in the lab (B=0.262, t=8.86, p<0.001), and the other way round at the faire (B=-0.06, t=-2.47, p=0.014). 3.2. Temporal Evolution and Learning Given such different distributions and patterns, it is not surprising that the two models performed worst at the faire. The crucial factor is whether the classifiers can learn and adapt to the novel environment. To better understand the learning process of the Adaptive Random Forest, we observed the temporal evolution of the accuracy and F1-score curves. Figure 2: (A and B) Cumulative Accuracy (A) and F1-score (B) at the Maker Faire Rome 2022 for the Random Forest (blue) and Adaptive Random Forest (ARF) models. (C and D) Simulated Accuracy and F1-score over 30 randomized permutations of the visitors Figure 2 (A-B) shows the cumulative accuracy, and F1-score of the two models as visitors played the game. Please remember that the accuracy metric balances the performances on detecting both creative and descriptive attempts; the F1-score instead focuses on how good the models are on detecting the creative class only. Both metrics of the Random Forest classifier converge to a plateau; for the Adaptive Random Forest instead, the two metrics are increasing, particularly visible for the F1-score. To verify whether the adaptability of the latter model produces the progressive increase it is first necessary to mitigate the order effect of how visitors interacted with the game. Indeed the order by which examples are provided could impact the learning rate and the convergence time of an online model [44]. Hence, using the faire dataset only, we simulated 30 random permutations of the visitors interacting with the game (i.e., keeping the order of the described card unaffected). For each permutation, we generated and evaluated two new models. Figure 2 (C-D) shows the simulation’s cumulative accuracy and F1-score. As it is possible to see, the ARF model’s final accuracy (M=52.9%, SD=0.02%) and F1-score (M=61%, SD=0.07%) are poorly affected by the order - the final RF metrics are unaffected, as expected. Also, the increasing and converging patterns of the RF and ARF curves are preserved for accuracy and F1-score. To better quantify this increasing trend, we analyzed the average slope of the two metric curves, for both models, among the 30 simulated permutations. Given the normal distribution of the slope averages, we opted for a parametric test. A paired t-test showed that the slopes of both accuracy (t=3.91, p<0.001) and F1-score (t=13.22, p<0.001) of the ARF were higher than the RF ones - please notice the same statistically significant difference was present also for the precision (t=8.25, p<0.001), recall (t=15.51, p<0.001) and ROCAUC (t=3.8, p<0.001). Hence, with more interacting players, the ARF model’s learning and adaptation ability would make it surpass and overcome the other static classifier. 4. Discussion & Conclusion In this manuscript, we tested the performance of our trained lying (i.e., creativity) classifier in the chaotic environment of the Maker Faire Rome 2022. Grasping humans’ creative process is a highly complex problem in a closed and controlled laboratory, even more when approaching realistic environments. As proved by humans, the key to "surviving" in the wild is learning and adaptation. Hence, we compared the performance of our static model with respect to an adaptive counterpart trained on the same data and features. As expected, the Maker Faire dataset was both data and concept drifted with respect to the in-laboratory collected data. As a result, even if the two models discriminated descriptive and creative card descriptions better than chance, their performance decreased with respect to the testing ones. However, by analyzing the average slope of the accuracy and F1-score metrics, we showed how the Adaptive Random Forest was learning and adapting to the novel environment. It already surpassed the static counterpart on the F1-score metrics (i.e., the goodness on recognizing the creative attempts only); moreover, the positive slopes of the curves suggest it would also improve the accuracy and other metrics. In the manuscript, we omitted the case where the static Random Forest is retrained with the novel examples (e.g., at the end of each day). Even if such a solution is reasonable from a pure computer science point of view, it could not be the best option thinking about an intelligent agent (e.g., a humanoid robot) embedding the lie detection system in an actual application. Indeed, the speed by which a model adapts and improves is crucial in human-robot and human-computer interaction. Aiming to maximize the quality of each interactive session, a daily-retrained Random Forest would not compete with a per-interaction adaptive model. Both models are still limited in the timing they provide classifications: before classifying players’ behavior, they have to observe the full card description. It could be more effective - and interesting - to recognize creativity and deception from sub-segments of the interactions, i.e., by incrementally gaining confidence until one of the classes is grounded. Focusing on the adaptive interactive classifier, the model still depends on players’ explicit feedback (i.e., validating or rejecting the classification). In a realistic environment, it would not always be possible to access an explicit ground truth, especially in the lie detection field. For this purpose, we speculate that subjective adaptation and reliance on past experiences could effectively recognize implicit signals to validate the classifications. Generally speaking, our system is still limited by relying on pupillometry only. Even if literature proves pupillometry is affected by lying [29] and creativity [13], pupil fluctuations are not one-to-one bound to such behaviors. They rather reflect variations in cognitive effort and emotional arousal. To be reliable in evaluating humans’ creativity, they must be considered with respect to a specific context or task (i.e., Task Evoked Pupillary Responses [28]). Hence, intelligent systems aiming to grasp humans’ reactions based on pupillometry, or other physiological effects, must be able to model the context in which the interaction takes place, reporting humans’ reactions to it. The second general limitation is the usage of gaming cards as a creative medium. We opted for Dixit gaming cards because they are designed to stimulate creativity and divergent thinking. However, aiming for realistic interactions, the medium should be generalized (e.g., using generic pictures of photographs) or even removed. To improve our system, we are pursuing this research in four parallel directions: (i) classifying humans’ behavior as nearly real-time by processing sequential chunks of the card descriptions; (ii) developing subjective models able to learn how a specific human partner lies and reacts when a classification is provided, using such knowledge to both improve the classification and recognize implicit feedback; (iii) including multiple modalities in the classification (i.e., posture and verbal prosody) to understand better the context in which the interaction happens and to improve the robustness of the model; and (iv) looking for innovative solutions[32, 33, 34] to measure pupillometry from standard RGB cameras. Besides the applications in the lie detection and creativity understanding fields, our research is based on evaluating humans’ cognitive effort when facing diverging and stressful tasks. We speculate that our findings would help in diverse fields like security, teaching, and caregiving by endowing intelligent virtual and robotic agents to understand humans’ behavior and better support us. Acknowledgments This work has been supported by a Starting Grant from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme. GA No 804388, wHiSPER. References [1] M. A. Runco, G. J. Jaeger, The standard definition of creativity, Creativity Research Journal 24 (2012) 92–96. doi:10.1080/10400419.2012.650092. [2] A. J. Cropley, The dark side of creativity: What is it?, 2010. doi:10.1017/ CBO9780511761225.001. [3] J. J. Walczyk, M. A. Runco, S. M. Tripp, C. E. Smith, The creativity of lying: Divergent thinking and ideational correlates of the resolution of social dilemmas, Creativity Research Journal 20 (2008) 328–342. doi:10.1080/10400410802355152. [4] N. Hao, M. Tang, J. Yang, Q. Wang, M. A. Runco, A new tool to measure malevolent creativity: The malevolent creativity behavior scale, Frontiers in Psychology 7 (2016) 1–7. doi:10.3389/fpsyg.2016.00682. [5] M. L. Beaussart, C. J. Andrews, J. C. Kaufman, Creative liars: The relationship between creativity and integrity, Thinking Skills and Creativity 9 (2013) 129–134. URL: http: //dx.doi.org/10.1016/j.tsc.2012.10.003. doi:10.1016/j.tsc.2012.10.003. [6] B. M. DePaulo, S. E. Kirkendol, D. A. Kashy, M. M. Wyer, J. A. Epstein, Lying in everyday life, Journal of Personality and Social Psychology 70 (1996) 979–995. doi:10.1037/0022-3514. 70.5.979. [7] B. M. DePaulo, B. E. Malone, J. J. Lindsay, L. Muhlenbruck, K. Charlton, H. Cooper, Cues to deception, 2003. URL: http://doi.apa.org/getdoi.cfm?doi=10.1037/0033-2909.129.1.74. doi:10.1037/0033-2909.129.1.74. [8] O. FeldmanHall, P. Glimcher, A. L. Baker, E. A. Phelps, Emotion and decision-making under uncertainty: Physiological arousal predicts increased gambling during ambiguity but not risk, Journal of Experimental Psychology: General 145 (2016) 1255–1262. doi:10. 1037/xge0000205. [9] C. Hadnagy, Social engineering: The art of human hacking, The Art of Human Hacking 3 (2010) 408. doi:10.1504/ijipsi.2018.10013213. [10] C. Tosone, Living everyday lies: The experience of self, Clinical Social Work Journal 34 (2006) 335–348. doi:10.1007/s10615-005-0035-z. [11] C. F. Bond, B. M. DePaulo, Accuracy of deception judgments, Personality and Social Psychology Review 10 (2006) 214–234. doi:10.1207/s15327957pspr1003_2. [12] C. R. Honts, D. C. Raskin, J. C. Kircher, Mental and physical countermeasures reduce the accuracy of polygraph tests, Journal of Applied Psychology 79 (1994) 252–259. URL: http://www.ncbi.nlm.nih.gov/pubmed/8206815. doi:10.1037/0021-9010.79.2.252. [13] M. M. Bradley, L. Miccoli, M. A. Escrig, P. J. Lang, The pupil as a measure of emotional arousal and autonomic activation, Psychophysiology 45 (2008) 602–607. doi:10.1111/j. 1469-8986.2008.00654.x. [14] M. Gamer, Detecting of deception and concealed information using neuroimaging techniques, 2011, pp. 90–113. URL: https://www.cambridge.org/core/product/identifier/ CBO9780511975196A018/type/book_part. doi:10.1017/CBO9780511975196.006. [15] B. A. Rajoub, R. Zwiggelaar, Thermal facial analysis for deception detection, IEEE Transactions on Information Forensics and Security 9 (2014) 1015–1023. URL: http:// ieeexplore.ieee.org/document/6797879/. doi:10.1109/TIFS.2014.2317309. [16] C. Y. Ma, M. H. Chen, Z. Kira, G. AlRegib, Ts-lstm and temporal-inception: Exploiting spa- tiotemporal dynamics for activity recognition, Signal Processing: Image Communication 71 (2019) 76–87. URL: http://arxiv.org/abs/1703.10667. doi:10.1016/j.image.2018.09. 003. [17] V. Karpova, P. Popenova, N. Glebko, V. Lyashenko, O. Perepelkina, "was it you who stole 500 rubles?" - the multimodal deception detection, 2020, pp. 112–119. doi:10.1145/ 3395035.3425638. [18] X. L. Chen, S. I. Levitan, M. Levine, M. Mandic, J. Hirschberg, Acoustic-prosodic and lexical cues to deception and trust: Deciphering how people detect lies, Transactions of the Association for Computational Linguistics 8 (2020) 199–214. doi:10.1162/tacl_a_ 00311. [19] A. Gaggioli, Beyond the truth machine: Emerging technologies for lie detection, Cyberpsy- chology, Behavior, and Social Networking 21 (2018) 144–144. URL: http://www.liebertpub. com/doi/10.1089/cyber.2018.29102.csi. doi:10.1089/cyber.2018.29102.csi. [20] D. Pasquali, J. Gonzalez-Billandon, A. M. Aroyo, G. Sandini, A. Sciutti, F. Rea, Detecting lies is a child (robot)’s play: Gaze-based lie detection in hri, International Journal of Social Robotics (2021) 1–16. URL: https://link.springer.com/article/10.1007/s12369-021-00822-5. doi:10.1007/s12369-021-00822-5. [21] S. Vinanzi, M. Patacchiola, A. Chella, A. Cangelosi, Would a robot trust you? developmental robotics model of trust and theory of mind, CEUR Workshop Proceedings 2418 (2019) 74. doi:https://doi.org/10.1098/rstb.2018.0032. [22] M. Patacchiola, A. Cangelosi, A developmental cognitive architecture for trust and theory of mind in humanoid robots, IEEE Transactions on Cybernetics (2020) 1–13. doi:10.1109/ TCYB.2020.3002892. [23] G. Metta, G. Sandini, D. Vernon, L. Natale, F. Nori, The icub humanoid robot: An open platform for research in embodied cognition, ACM Press, 2008, pp. 50–56. URL: http: //portal.acm.org/citation.cfm?doid=1774674.1774683. doi:10.1145/1774674.1774683. [24] J. G. May, R. S. Kennedy, M. C. Williams, W. P. Dunlap, J. R. Brannan, Eye movement indices of mental workload, Acta Psychologica 75 (1990) 75–89. doi:10.1016/0001-6918(90) 90067-P. [25] M. Nakayama, Y. Shimizu, Frequency analysis of task evoked pupillary response and eye-movement, ACM Press, 2004, pp. 71–76. URL: http://portal.acm.org/citation.cfm?doid= 968363.968381. doi:10.1145/968363.968381. [26] B. C. Goldwater, Psychological significance of pupillary movements, Psychological Bulletin 77 (1972) 340–355. doi:10.1037/h0032456. [27] S. Mathôt, J. Fabius, E. V. Heusden, S. V. der Stigchel, Safe and sensible preprocessing and baseline correction of pupil-size data, Behavior Research Methods 50 (2018) 94–106. doi:10.3758/s13428-017-1007-2. [28] J. Beatty, B. Lucero-Wagoner, The pupillary system, Handbook of psychophysiology 2 (2000). URL: https://psycnet.apa.org/record/2000-03927-005. [29] D. P. Dionisio, E. Granholm, W. A. Hillix, W. F. Perrine, Differentiation of decep- tion using pupillary responses as an index of cognitive processing, Psychophysiology 38 (2001) 205–211. URL: http://www.ncbi.nlm.nih.gov/pubmed/11347866. doi:10.1017/ S0048577201990717. [30] A. Szulewski, N. Roth, D. Howes, The use of task-evoked pupillary response as an objective measure of cognitive load in novices and trained physicians: A new tool for the assessment of expertise, Academic Medicine 90 (2015) 981–987. doi:10.1097/ACM. 0000000000000677. [31] M. I. Ahmad, J. Bernotat, K. Lohan, F. Eyssel, Trust and cognitive load during human-robot interaction, 2019. URL: https://arxiv.org/abs/1909.05160v1. [32] C. Wangwiwattana, X. Ding, E. C. Larson, Pupilnet, measuring task evoked pupillary response using commodity rgb tablet cameras, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1 (2018) 1–26. URL: http://dl.acm.org/ citation.cfm?doid=3178157.3161164. doi:10.1145/3161164. [33] S. Rafiqi, C. Wangwiwattana, J. Kim, E. Fernandez, S. Nair, E. C. Larson, Pupilware, ACM, 2015, pp. 1–8. URL: https://dl.acm.org/doi/10.1145/2769493.2769506. doi:10.1145/ 2769493.2769506. [34] S. Eivazi, T. Santini, A. Keshavarzi, T. Kübler, A. Mazzei, Improving real-time cnn-based pupil detection through domain-specific data augmentation, Eye Tracking Research and Applications Symposium (ETRA) (2019). doi:10.1145/3314111.3319914. [35] J. Gonzalez-Billandon, A. M. Aroyo, A. Tonelli, D. Pasquali, A. Sciutti, M. Gori, G. Sandini, F. Rea, Can a robot catch you lying? a machine learning system to detect lies during interactions, Frontiers in Robotics and AI 6 (2019). doi:10.3389/frobt.2019.00064. [36] A. Aroyo, J. Gonzalez-Billandon, A. Tonelli, A. Sciutti, M. Gori, G. Sandini, F. Rea, Can a humanoid robot spot a liar?, IEEE, 2018, pp. 1045–1052. URL: https://ieeexplore.ieee.org/ document/8624992/. doi:10.1109/HUMANOIDS.2018.8624992. [37] D. O. Iacob, A. Tapus, First attempts in deception detection in hri by using thermal and rgb-d cameras, RO-MAN 2018 - 27th IEEE International Symposium on Robot and Human Interactive Communication (2018) 652–658. doi:10.1109/ROMAN.2018.8525573. [38] D. O. Iacob, A. Tapus, Detecting deception in hri using minimally-invasive and noninvasive techniques, 2019 28th IEEE International Conference on Robot and Human Interactive Com- munication, RO-MAN 2019 (2019) 1–7. doi:10.1109/RO-MAN46459.2019.8956384. [39] D. Pasquali, A. M. Aroyo, J. Gonzalez-billandon, F. Rea, G. Sandini, A. Sciutti, Your eyes never lie: A robot magician can tell if you are lying, 2020. doi:https://doi.org/10. 1145/3371382.3378253. [40] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002) 321–357. doi:10.1613/jair.953. [41] J. Montiel, M. Halford, S. M. Mastelini, G. Bolmier, R. Sourty, R. Vaysse, A. Zouitine, H. M. Gomes, J. Read, T. Abdessalem, A. Bifet, River: machine learning for streaming data in python (2020). URL: http://arxiv.org/abs/2012.04740. [42] H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck, B. Pfharinger, G. Holmes, T. Abdessalem, Adaptive random forests for evolving data stream classification, Machine Learning 106 (2017) 1469–1495. doi:10.1007/s10994-017-5642-8. [43] T. Pro, Quick tech webinar - secrets of the pupil, ???? URL: https://www.youtube.com/ watch?v=I3T9Ak2F2bc&feature=emb_title. [44] A. Cornu6jols, Getting order independence in incremental learning, ????