Emotional Mario Task at MediaEval 2021
       Mathias Lux1 , Michael Riegler2 , Steven Hicks2 , Duc-Tien Dang-Nguyen3,4 , Kristine Jorgensen3 ,
                                    Vajira Thambawita2 , Pål Halvorsen2
                                                       1 Alpen-Adria Universität Klagenfurt, Austria
                                                                       2 SimulaMet, Norway
                                                                3 University of Bergen, Norway
                                        4 Kristiania University College, Norway

       mlux@itec.aau.at,michael@simula.no,steven@simula.no,ductien.dangnguyen@uib.no,kristine.jorgensen@uib.no
                                          vajira@simula.no,paalh@simula.no

ABSTRACT
Video games are often understood as engines of experience, and
the interaction with the game lets players consume carefully con-
structed experiences. While it is generally agreed upon that a good
experience makes a good game, methods for measuring or observ-
ing the impact of the gameplay on the players’ experience are still
an open problem. In the 2021 Emotional Mario task, we ask re-
searchers to investigate the gameplay of ten study participants on
one of the most iconic classic video games: Super Mario Bros. We
provide data to learn from, including heart rate, skin conductiv-
ity, videos of the players’ faces synchronized to the gameplay, the
gameplay itself, and player demographics including their scores                      Figure 1: Collage of the first frames of all 32 Super Mario Bros.
and times spent in the game. Participants of the task are asked to                   levels organized in eight worlds with four levels each.
predict gameplay events based on the biometric and facial data of
the players.                                                                         demographics, biomedical, sensory input from a medical-grade
                                                                                     device, and videos of their faces while playing the game.

1    INTRODUCTION                                                                    2    TASK DESCRIPTION
With the rise of deep learning, several large leaps in research have                 The goal in the Emotional Mario task is to relate data about the
been achieved in recent years such as human-level image recogni-                     players, e.g., their heart rate, skin conductivity, or their facial expres-
tion, text classification, and even content creation. Games and deep                 sions, to the gameplay and events like Mario losing a life, finishing a
learning also have a relatively long history together, specifically in               level, or gaining a power-up by consuming a mushroom. Emotional
reinforcement learning. However, video games still pose a lot of                     Mario is structured into two subtasks:
challenges. Games are understood as engines of experience [9], and                          • In the first subtask, we asked participants to identify events
as such, they need to invoke human emotions. While emotion recog-                             of high significance in the gameplay by just analyzing the
nition has come a far way over the last decade [7], the connection                            facial video and the biometric data. Such significant events
between emotions and video games is still an open and interesting                             include the end of a level, a power-up or extra life for Mario,
research question. As games are designed to evoke emotions [9], we                            or Mario’s death.
hypothesize that emotions in the player are reflected in the visuals                        • For the second subtask, which was optional, we asked
of the video game. Simple examples are when players are happy                                 participants to create a video summary of the best moments
after having mastered a particularly complicated challenge, when                              of the play. This can include gameplay scenes, facial video,
they are shocked by a jump scare scene in a horror game, or when                              data visualization, and whatever can help such a summary.
they are excited after unlocking a new resource. Questionnaires
can measure these things after playing [1], but in the Emotional                     3    DATASET
Mario task, we want to interconnect emotions and gameplay based
on data instead of asking the players.                                               The task provides a dataset of videos and sensor readings of people
   For the Emotional Mario challenge, we focus on the iconic Super                   playing Super Mario Bros [8]. In total, a population of ten people
Mario Bros. video game and provide a multimodal dataset based                        was selected for data gathering, ranging from gaming veterans
on a Super Mario Bros. implementation for OpenAI Gym [2]. For                        to novice players, with an even split between male and female
a population of ten players, the dataset contains their game input,                  participants. Each participant provided a written form of consent,
                                                                                     allowing for their video, gameplay data, and sensor data to be
                                                                                     shared openly for research and teaching purposes under a Creative
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                   Commons Attribution-NonCommercial 4.0 International License1 .
MediaEval’21, December 13-15 2021, Online
                                                                                     1 http://creativecommons.org/licenses/by-nc/4.0/,accessed2020-11-04
MediaEval’21, December 13-15 2021, Online                                                                                                        M. Lux et al.


The dataset can be accessed via https://datasets.simula.no/toadstool       Table 1: Best evaluation results per group per evaluation
or https://osf.io/qrkcf/.                                                  and the averaged random baseline for matching event times-
   For each participant, a range of multimodal data was recorded           tamps in the range of +/- 5 seconds
and included in the dataset:
      • a video file recording the participants face with a 1.3-MP                  Team                          Precision       Recall      F1 score
        webcam with 30fps and 640x480 pixels,                                       DCU-Gamestreamer                0.0021        0.8903       0.0041
      • the controller input performed on each frame of the game                    GSE-AAU                         0.0242        0.0812       0.0373
        utilizing a wired USB controller from retro-bit, which is                   Random                          0.2847        0.2847       0.2947
        modeled after the original controller for the Nintendo En-
        tertainment System,
      • sensor data collected from an Empatica E4 wristband [4]            Table 2: Best evaluation results per group per evaluation and
        including heart rate, temperature, skin conductivity, and          the averaged random baseline for matching events in the
        accelerometer data, and                                            range of +/- 5 seconds
      • video game action files, which are scripts to generate the
        video game frames.                                                          Team                          Precision       Recall      F1 score
   Besides the actual data, the provided dataset includes documen-                  DCU-Gamestreamer                0.0014        0.5709       0.0028
tation and process description as well. Additional data ranges from                 GSE-AAU                         0.0112        0.0849       0.0197
the original questionnaire presented to the participants and their                  Random                          0.0667        0.0667       0.0667
answers, the consent form signed by the participants, the license,
and a README.txt file detailing the use of the dataset. A detailed
description of the dataset is given in [8].                                randomly in the range of actual number of events +/- 50%2 . Table 1
   In addition to the dataset, we provide (i) ground truth data for        and Table 2 give an overview of the best results of each group as
the events in the game for 7 out of 10 participants (the remaining         well as the averaged random baseline.
three are used for the test set), and (ii) results from an automated          We expected few submissions for the second subtask and wanted
facial expression recognition package [3] including a confidence           to employ a qualitative, heuristic evaluation. An expert panel with
value for the basic emotions anger, disgust, fear, happiness, sadness,     professionals and researchers from the field of game development,
and surprise, as well as a neutral expression along with a bounding        game studies, e-sports, and media sciences should have investigated
box for the detected face. Examples are shown in Fig. 2.                   the submissions and judged them for:
                                                                                (1) Informative value (i.e., is it a good summary of the game-
                                                                                    play),
                                                                                (2) Accuracy (i.e., does it reflect the emotional up and downs
                                                                                    and the skill of the play), and
                                                                                (3) Innovation (i.e., surprisingly new approach, non-linearity
                                                                                    of the story, creative use of cuts, etc.)
                                                                           Unfortunately, we did not receive submissions for the second sub-
                                                                           task.

                                                                           5     DISCUSSION AND OUTLOOK
Figure 2: Sample output from the facial expression recogni-                While the MediaEval Emotional Mario task is the spiritual successor
tion algorithm with bounding box and prediction. Videos                    of the Gamestory task [5, 6], the goals are different. The work
taken from the Toadstool data set [8].                                     on Counter-Strike: Global Offensive and the analysis of the game
                                                                           streaming and e-sports phenomena have shown the substantial
                                                                           impact games have on culture and society. With the availability of
                                                                           biometric sensors and deep learning for data analysis, we re-focus
4    EVALUATION                                                            on the interrelation of the game and the player’s experience.
                                                                              With the Emotional Mario task, we hope to outline the direction
The evaluation of the task is two-fold. For the first subtask, we
                                                                           of research where player-game interaction can be extended, and
collect the participants’ output on finding events for the missing
                                                                           games as engines of experience can be understood. Games are not
participants. We investigated the precision, recall, and f1 score of the
                                                                           only a playground for people. They are also a vast resource for
events. We ran four different types of evaluations. We investigated
                                                                           research and future developments.
if participants found the events in a range of +/- one second of
the actual events and did the same for +/- five seconds. Two more
                                                                           ACKNOWLEDGMENTS
evaluations were done focusing on the time of the event, discarding
the type. A random baseline was created by simulating runs with            We’d like to thank Dr. Andreas Leibetseder for his support.
randomized events. The random baseline is biased by the knowledge          2 Evaluation scripts and creation of the random baseline can be found on https://github.
of how many events are expected and chooses the number of events           com/dermotte/EmotionalMarioEvaluation
Emotional Mario: A Games Analytics Challenge                                   MediaEval’21, December 13-15 2021, Online


REFERENCES
[1] Vero Vanden Abeele, Katta Spiel, Lennart Nacke, Daniel Johnson, and
    Kathrin Gerling. 2020. Development and validation of the player ex-
    perience inventory: A scale to measure player experiences at the level
    of functional and psychosocial consequences. International Journal of
    Human-Computer Studies 135 (2020), 102370.
[2] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider,
    John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym.
    arXiv preprint arXiv:1606.01540 (2016).
[3] Justin Shenk et al. 2021. Facial Expression Recognition with a deep neu-
    ral network as a PyPI package. (2021). https://github.com/justinshenk/
    fer
[4] Maurizio Garbarino, Matteo Lai, Dan Bender, Rosalind W Picard, and
    Simone Tognetti. 2014. Empatica E3—A wearable wireless multi-sensor
    device for real-time computerized biofeedback and data acquisition.
    In Proceedings of the 4th International Conference on Wireless Mobile
    Communication and Healthcare-Transforming Healthcare Through In-
    novations in Mobile and Wireless Technologies (MOBIHEALTH). IEEE,
    39–42.
[5] Mathias Lux, Michael Riegler, Duc-Tien Dang-Nguyen, Marcus Lar-
    son, Martin Potthast, and Pål Halvorsen. 2018. GameStory Task at
    MediaEval 2018.. In Proceedings of MediaEval.
[6] Mathias Lux, Michael Riegler, Duc-Tien Dang-Nguyen, Johanna Pirker,
    Martin Potthast, and Pål Halvorsen. 2019. GameStory Task at Media-
    Eval 2019.. In Proceedings of MediaEval.
[7] Anvita Saxena, Ashish Khanna, and Deepak Gupta. 2020. Emotion
    recognition and detection methods: A comprehensive survey. Journal
    of Artificial Intelligence and Systems 2, 1 (2020), 53–79.
[8] Henrik Svoren, Vajira Thambawita, Pål Halvorsen, Petter Jakobsen,
    Enrique Garcia-Ceja, Farzan Majeed Noori, Hugo L Hammer, Mathias
    Lux, Michael Alexander Riegler, and Steven Alexander Hicks. 2020.
    Toadstool: a dataset for training emotional intelligent machines play-
    ing Super Mario Bros. In Proceedings of the ACM Multimedia Systems
    Conference (MMSYS). 309–314.
[9] Tynan Sylvester. 2013. Designing games: A guide to engineering experi-
    ences. " O’Reilly Media, Inc.".