Fall Detection using NAO Robot Pose Estimation in
RoboCup SPL Matches
Cristian Zampino1 , Flavio Biancospino1 , Michele Brienza1 , Francesco Laus1 ,
Gianluca Di Stefano1 , Rocchina Romano1 , Andrea Pennisi1 , Vincenzo Suriani2 and
Domenico Daniele Bloisi1
1
    Department of Mathematics, Computer Science, and Economics, University of Basilicata, Potenza, Italy
2
    Department of Computer, Control, and Management Engineering, Sapienza University of Rome, Rome, Italy


                                         Abstract
                                         RoboCup is an International robotics initiative whose aim is to promote robotics and AI research.
                                         RoboCup’s long-term goal is to create a fully autonomous humanoid robot team capable of competing
                                         and winning a soccer game against the human World champion team, in compliance with the official
                                         rules of FIFA, by 2050. In this paper, we describe a two-step method for action recognition. In the
                                         first step, we extract the pose of the robots using a pose detector trained on a novel dataset for pose
                                         estimation called UNIBAS NAO Pose Dataset, which is a contribution of this work. In the second step,
                                         a Spatial-Temporal Graph Convolutional Network is used for modeling the gameplay, with particular
                                         regard to fall-down detection. Experimental results show the effectiveness of our approach in detecting
                                         falls for humanoid robots.

                                         Keywords
                                         Fall detection, NAO robot pose estimation, Robot soccer, Sports analytics


1. Introduction
Since its foundation in 1996, one of the objectives of RoboCup was to push the boundaries of
research by offering high-level challenges. In 2022, the RoboCup Standard Platform League
(concerning NAO robots playing soccer 5 vs 5) has proposed an Open Research Challenge [1]
with the goal of creating an autonomous system for generating game statistics from the videos of
the matches (see Fig. 1). Open Research Challenge’s motivation lies in the need of augmenting
the data available from the Game Controller software, which has proven to be insufficient to
extract consistent statistics from a match [2]. The Open Research Challenge has two main goals:
             1. A short-term goal, with the purpose of calculating the extrinsic camera parameters (camera
                matrix) from the camera feed and tracking the ball and the robots on the field;
             2. A long-term goal, aiming at creating game statistics (including time under control, success-
                ful, unsuccessful shots on goal, passes, etc.) based on the located objects and positions.

AIRO 2022 The 9th Italian Workshop on Artificial Intelligence and Robotics, held in conjunction with AI*IA 2022 The
21st International Conference of the Italian Association for Artificial Intelligence
Envelope-Open andrea.pennisi@unibas.it (A. Pennisi); suriani@diag.uniroma1.it (V. Suriani); domenico.bloisi@unibas.it
(D. D. Bloisi)
Orcid 0000-0002-9081-0765 (A. Pennisi); 0000-0003-1199-8358 (V. Suriani); 0000-0003-0339-8651 (D. D. Bloisi)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: Example of a RoboCup SPL match recorded with a single fixed fish-eye camera. Image from
www.youtube.com/RoboCupSPL


   UNIBAS Wolves and SPQR Team participated in the Open Research Challenge presenting
MARIO, an end-to-end architecture for computing visual statistics in RoboCup SPL [3]. One
of the features of MARIO is the modular architecture, which allows the user to customize the
system for extracting statistics. In this paper, we focus on one of the MARIO’s modules, used
for detecting the robot fall-down action. The novelty of this work consists in combining a NAO
robot pose estimation method with a graph convolutional neural network-based approach for
recognizing falling robots. A further contribution is the development of the UNIBAS NAO Pose
Dataset 1 , containing labeled images and videos of NAO robot poses.
   The remainder of the paper is organized as follows. Section 2 contains a description of
existing methods for extracting statistics from soccer matches. Section 3 presents the method
for detecting the robot fall-down action, including the description of the dataset used for training.
Quantitative experimental results are shown in Section 4. Finally, conclusions are drawn in
Section 5.


2. Related Work
The analysis of soccer matches videos aims at:
   1. Investigating opponent’s strategies;
   2. Examining the team’s collective performance;
   3. Calculating useful statistics designed to validate game schemes.
  The goal of the analysis is to collect data to improve the quality of play so as to increase the
chance of winning. At the basis of conducting a video analysis of soccer matches, it is essential
   1
       available at https://drive.google.com/drive/folders/1wY9Xsz3O_gYc4BbGb4p_gALotynjcH-E?usp=sharing
to identify the athlete, and by employing increasingly advanced technologies, we have gone as
far as identifying the type of motion and detecting interactions with opponents.
   Video analysis approaches can be grouped in two main categories. The first group includes
commercial systems that are semi-automatic, meaning that the labeling of players is entered
manually through user interaction. Semi-automatic commercial systems are used, for example,
by TVs or professional teams to observe game situations and actions that occurred during
matches. The second group is made of commercial and academic systems that are designed
toward eliminating user interaction.
   Semi-automatic systems include Piero [4] and Viz Libero [5], which are capable of computing
many different statistics (distances, trajectories, heatmaps, free spaces, interaction spaces, etc.).
Among the group of fully-automatic systems, the system proposed by Stein et al. [6] presents
the labeling and tracking of players in an automatic way.
   The use of supervised algorithms is a common characteristic of fully-automatic approaches.
For training those systems it is crucial to collect a large dataset containing properly annotated
samples. For example, SoccerNet [7] is a large-scale dataset for soccer video understanding.
It can be used to perform action recognition by individuating different categories of events,
e.g., goals, substitutions, and yellow and red cards. An extension of SoccerNet is SoccerNet-v2
[8] that includes 17 categories of actions with detection statistics every twenty-five seconds
(a further improvement when compared to the 8.7 actions per hour in SoccerNet). SoccerDB
[9] is a soccer database similar to SoccerNet but enhanced by the use of bounding boxes for
individual player detection.
   In this work, we present MARIO [3] a fully-automatic system specifically designed for
analysing NAO soccer robot matches. MARIO ranked first, ex-aequo with the B-Human Team’s
system, in the Open Research Challenge at RoboCup 2022. Robot and ball tracking in MARIO
are done automatically, using a combination of traditional [10] and deep learning [11] based
computer vision methods. Game analysis can extract trajectories, passes made, and heatmaps
through graphs and tables containing both traditional statistics and more advanced statistics
within the field, such as falls and foul actions made by the robots. In particular, we describe
two supervised algorithms included in MARIO. The first is devoted to the calculation of the
poses of the NAO robots, while the second aims at detecting falling down events.


3. Proposed Method
The fall-down detection scheme includes:
   1. A pose estimation module;
   2. A graph convolutional neural network for detecting the movement.
   Before describing our approach, it is convenient to introduce our dataset for NAO robot pose
estimation, called UNIBAS NAO Pose Dataset [12].

3.1. UNIBAS NAO Pose Dataset
NAO are humanoid robots, however, there are some differences between the NAO structure and
the human one. Thus, we have created a specific dataset for detecting the pose of a NAO robot.
Figure 2: Skeleton key points used for describing the pose of a NAO robot.


In particular, we collected video frames from the RoboCup Standard Platform League (SPL)
teams. Using the COCO Annotator tool [13], we labeled 451 frames containing about 3,000
NAO robot instances in the well-known COCO format. All annotations share the following
data structure. The pose is represented by up to 18 key points describing ears, eyes, nose, neck,
shoulders, elbows, wrists, hips, knees, and ankles (see Fig. 2). The annotations are stored using
a JSON structure. UNIBAS NAO Pose Dataset is available at the following link.

3.2. NAO Pose Estimation
To detect the pose of the NAOs we used a slightly optimized version of OpenPose, called
Lightweight OpenPose [14]. In this version, the VGG feature extractor is replaced with a
MobileNetV1 network and it is optimized for reaching real-time performance. In particular,
all the layers are kept and a dilated convolution is used for saving the spatial resolution and
reusing the backbone weights.
   To produce a new estimation of keypoint heatmaps and Part Affinity Fields (PAFs), the
refinement stage takes features from the backbone concatenated with the previous estimation
of the keypoint heatmaps and PAFs. In order to share the computation between heatmaps and
PAFs, a single prediction branch is used in the initial and refinement stage. All the layers are
shared except for the last two, which directly produce keypoint heatmaps and PAFs.
   To capture long-range spatial dependencies, each convolution with 7×7 kernel size is replaced
by a convolutional block with the same receptive field. This block design allows having three
consecutive convolutions with 1×1, 3×3, and 3×3 kernel size convolutional layers. To preserve
the initial receptive field the last one has a dilation parameter equal to 2. Finally, a residual
connection is used for speeding up the network. The final network has ∼2.5 times less complexity
than convolution with 7×7 kernel.
Figure 3: ST-GCN architecture diagram.


Figure 4: Three examples of skeletons. Left: A rejected skeleton with 5 missing key points. Center: An
accepted skeleton with no missing key points. Right: An accepted skeleton with 2 missing key points.


3.3. Action Detection
The detection of the fall-down action is carried out by using a Spatial-Temporal Graph Convo-
lutional Network (ST-GCN). The ST-GCN model includes 9 layers of spatial-temporal graph
convolution operators, namely ST-GCN units (see Fig. 3).
   The output size is 64 for the first three layers, 128 for the following three layers, and 256 for
the last three layers. A dropout layer with a probability of 0.5 is present after each ST-GCN unit
to prevent overfitting. The strides of the fourth and seventh temporal convolution layers are set
to 2 as pooling layers. After the 9 ST-GCN units, a global pooling is performed on the resulting
tensor to get a 256-dimension feature vector for each sequence. Then, the 256-feature vector is
fed to a softmax classifier. Stochastic gradient descent with a learning rate of 0.01 is used as
optimization method. The learning rate is programmed to decay by 0.1 every 10 epochs.
   ST-GCN takes as input the output produced by the pose estimator. Since the trained ST-GCN
requires 14 skeleton key points, the 18 key points coming from the NAO pose estimator are
modified by removing the key points representing ears and eyes. The input for the ST-GCN
model is a sequence of 30 skeletons. It is worth noticing that a skeleton is not included in the
sequence if the number of key points returned by Lightweight OpenPose does not reach 80% of
the expected total (see Fig. 4).
Figure 5: Pose estimation for multiple humanoid NAO robots in different RoboCup SPL matches.


   The ST-GCN model outputs the action class with the related probability score. If the prob-
ability score is greater than 50%, the action class produced by the model is considered valid,
otherwise, it is rejected. For training the model, we used the Le2i data set proposed by Charfi et
al. [15], which includes 192 video sequences containing human fall actions.


4. Experimental Results
The performance of the proposed approach has been measured by considering separately the
pose estimation module and the fall detection system.

4.1. Pose Estimation Results
Fig. 5 shows the pose estimation results on four sample frames extracted from different
RoboCup matches. A video showing more pose estimation results is available at https://youtu.
be/QhtPp6-n3hM
   In order to quantitatively evaluate the Lightweight OpenPose performance, we considered the
percentage of correctly classified key points. Specifically, we computed an overall percentage
of valid key points of the video input. In our calculation, we measured a percentage of 70%
correctly classified key points.

4.2. Fall Detection Results
Fig. 6 shows a sequence where a falling robot is correctly detected. A video showing the results
of the fall-down detection system is available at https://youtu.be/a9rZD97vOT4
Figure 6: Fall detection. First, Lightweight OpenPose successfully detects the robot’s pose. Then, the
fall action is captured by the combined use of ST-GCN and the aspect ratio of the bounding box.


Table 1
Quantitative results for the fall detection system.
                                            Metric      Value
                                           Accuracy     0.924
                                          Sensitivity   0.997
                                          Specificity   0.921
                                             mAP        0.955


  Table 1 shows the experimental results. For evaluating the ST-GCN model, we performed
several experiments using the UNIBAS NAO Pose Dataset. We calculated the number of True
Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) detections to
compute several metrics, namely Accuracy, Sensitivity, and Specificity.

4.3. Discussion
The overall performance of the system is good. However, there are some errors. The accuracy
percentage of 70% in detecting the correct key points might be due either to the presence of
occlusions, which means that the robots are not correctly recognized, or an unusual robot pose
that is not included in the data set. In some cases, the robots are not completely facing the
camera, but instead are placed to the side of the camera.
   The camera position and the robot position also affect the ST-GCN system for fall-down
detection. Such a problem is related to the training data set, where the annotations are taken
from a different point of view with respect to the one used during the RoboCup matches.


5. Conclusions
We have presented an approach for detecting the fall-down action for NAO in RoboCup SPL
matches. In particular, the approach is based on two steps: 1) the estimation of the NAO
pose by using Lightweight OpenPose and 2) the detection of the fall-down action by using a
Spatial-Temporal Graph Convolutional Network (ST-GCN).
   For evaluating the pose estimation performance, we have conducted quantitative experiments
using a novel data set called UNIBAS NAO Pose Dataset. For training the ST-GCN, we used
publicly available data from Le2i dataset, while for testing it we considered a set of labeled
videos of Robocup matches. The proposed method proves its effectiveness in detecting the
fall-down action and provides a few false positives due to the camera position and the occlusions
generated by the other robots or the referee.
   As future work, we intend to create a new data set for NAO action recognition and we want
to increase the number of actions recognized by the system.


References
 [1] RoboCup Technical Committee, RoboCup SPL (NAO) Rule Book, 2022.
 [2] H. Mellmann, B. Schlotter, P. Strobel, Data driven research and development in robocup -
     collection, organization and analysis of robocup game data, 2018.
 [3] D. D. Bloisi, A. Pennisi, C. Zampino, F. Biancospino, F. Laus, G. Di Stefano, M. Brienza,
     R. Romano, Mario: Modular and extensible architecture for computing visual statistics in
     robocup spl, 2022. URL: https://arxiv.org/abs/2209.09987.
 [4] Piero sport graphics, 2022. URL: https://www.rossvideo.com/products-services/
     acquisition-production/cg-graphics-systems/piero-sports-graphics-analysis/.
 [5] Viz libero, 2022. URL: https://www.vizrt.com/products/viz-libero.
 [6] M. Stein, H. Janetzko, A. Lamprecht, Thorsten, Breitkreutz, P. Zimmermann, B. Goldlucke,
     T. Schreck, G. Andrienko, M. Grossniklaus, D. A. Keim, Bring it to the pitch: Combining
     video and movement data to enhance team sport analysis (2017).
 [7] Dataset soccernet, 2022. URL: https://www.soccer-net.org/home.
 [8] A. Deliège, A. Cioppa, S. Giancola, M. J. Seikavandi, J. V. Dueholm, K. Nasrollahi, B. Ghanem,
     T. B. Moeslund, M. V. Droogenbroeck, Soccernet-v2: A dataset and benchmarks for holistic
     understanding of broadcast soccer videos, 2020.
 [9] Y. Jiang, K. Cui, L. Chen, C. Wang, C. Xu, SoccerDB: A large-scale database for compre-
     hensive video understanding, 2020.
[10] D. D. Bloisi, A. Pennisi, L. Iocchi, Parallel multi-modal background modeling, Pattern
     Recognition Letters 96 (2017) 45 – 54.
[11] D. Albani, A. Youssef, V. Suriani, D. Nardi, D. D. Bloisi, A deep learning approach for object
     recognition with nao soccer robots, in: RoboCup 2016: Robot World Cup XX, Springer
     International Publishing, 2017, pp. 392–403.
[12] UNIBAS WOLVES, UNIBAS NAO Pose Dataset, 2022. URL: https://drive.google.com/drive/
     folders/1wY9Xsz3O_gYc4BbGb4p_gALotynjcH-E?usp=sharing.
[13] J. Brooks, COCO annotator, 2019. URL: https://github.com/jsbroks/coco-annotator.
[14] D. Osokin, Real-time 2d multi-person pose estimation on cpu: Lightweight openpose
     (2018).
[15] I. Charfi, J. Miteran, J. Dubois, M. Atri, R. Tourki, Definition and performance evaluation
     of a robust svm based fall detection solution, in: 8th Int. Conf. on Signal Image Technology
     and Internet Based Systems, 2012, pp. 218–224.