Few-shot Keypose Detection for Learning of Psychomotor Skills Benjamin Paaßen1 , Tobias Baumgartner2 , Mai Geisen2 , Nina Riedl2 and Miloš Kravčík1 Abstract Some psychomotor tasks require students to perform a specific sequence of poses and motions. A natural teaching scheme for such tasks is to contrast a student’s execution to a teacher demonstration. However, this requires strategies to match the teacher demonstration of each motion to the student’s attempts and to identify differences between demonstration and attempt. In this paper, we investigate methods to automatically detect student attempts for poses with only a single correct teacher demonstration. We investigate relevance learning, prototype networks, and attention mechanisms to achieve a robust few-shot approach which generalizes across students. In an experiment with one teacher and 27 students performing a sequence of motion elements from the field of fitness and dance, we show that prototype networks combined with an attention mechanism perform best.s Keywords psychomotor training, few-shot learning, metric learning, prototype networks, convolutional neural networks 1. Introduction Some psychomotor skills require us to execute a specific sequence of poses, such as in dance choreographies, as well as during the repetitive execution of fitness moves, e.g. squats. While research has emphasized the need for holistic teaching beyond mere imitation [1], at least part of beginner’s training is concerned with getting the basic set of poses correctly executed [2]. Our aim is to automate some of this basic teaching by contrasting a student’s current motion with a teacher demonstration and highlighting differences [3]. However, to correctly compare student and teacher motion, we need to establish a matching between both sequences of poses. Typically, matchings between sequences are performed via alignment distances such as dynamic time warping [4, 2]. However, such techniques are ill-suited to our scenario, where correct execution only depends on poses, not on the transition motion between poses. Our goal is to recognize the few points in time where the student’s attempted a certain pose and contrast the student’s execution to the teacher’s demonstration of the same pose. MILeS 2022: Proceedings of the second international workshop on Multimodal Immersive Learning Systems, September 13, 2022, Toulouse, France $ benjamin.paassen@dfki.de (B. Paaßen); t.baumgartner@dshs-koeln.de (T. Baumgartner); m.geisen@dshs-koeln.de (M. Geisen); n.riedl@dshs-koeln.de (N. Riedl); m.kravcik@dshs-koeln.de (M. Kravčík) € https://bpaassen.gitlab.io/ (B. Paaßen)  0000-0002-3899-2450 (B. Paaßen); 0000-0003-1194-3429 (T. Baumgartner); 0000-0002-3413-4600 (M. Geisen); 0000-0003-1224-1250 (M. Kravčík) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) An additional challenge is posed by the scarcity of training data. For any specific pose, we expect only a single (correct) demonstration by a teacher and only few annotated student sequences. In other words, we are confronted with a few-shot learning setting, where we need to share as much information across poses as possible to learn a viable model. In particular, we draw upon prior work on prototype networks [5] and metric learning [6]. Prototype networks would map each frame of student motion 𝑥𝑡 at time 𝑡 and each teacher demonstration 𝑤𝑘 for pose 𝑘 to auxiliary representations 𝑓 (𝑥𝑡 ) and 𝑓 (𝑤𝑘 ), and detect pose 𝑘 at time 𝑡 if the distance between 𝑓 (𝑥𝑡 ) and 𝑓 (𝑤𝑘 ) is low. By contrast, metric learning is concerned with finding a distance function 𝑑 such that the distance 𝑑(𝑥, 𝑦) is low if 𝑥 and 𝑦 should be close and large if 𝑥 and 𝑦 should not be close [6]. In our setting, we wish to learn a distance such that 𝑑(𝑥𝑡 , 𝑤𝑘 ) is small if the student attempted pose 𝑘 in frame 𝑡 and large, otherwise. A special kind of metric learning is to learn weights 𝛼𝑙 for each dimension 𝑙. If these weights are non-negative, such a weighting becomes equivalent to an attention mechanism, where we interpret a large 𝛼𝑙 as paying attention to dimension 𝑙, whereas small 𝛼𝑙 means that dimension 𝑙 is unimportant for the current decision [7]. Our contribution in this work is that we combine prototypical networks with an attention mechanism for the purpose of keypose recognition, where such techniques have not yet been applied, to the best of our knowledge. 2. Method Our task is to recognize times 𝑡 when a student tried to execute a certain pose 𝑘. More specifically, we record the motion of a student with a Kinect camera, yielding a time series 𝑥1 , 𝑥2 , . . . , 𝑥𝑇 , where each 𝑥𝑡 is a 26 × 3 matrix, storing the 3D position of 26 joints of the human body at frame 𝑡. We wish to compare the student’s frames to expert demonstrations 𝑤1 , . . . , 𝑤𝐾 for each of the 𝐾 poses and recognize frames where the student attempted a pose. Our basic approach is to compute some distance 𝑑(𝑥𝑡 , 𝑤𝑘 ) between student frames and teacher demonstrations and recognize pose 𝑘 at time 𝑡, whenever 𝑑(𝑥𝑡 , 𝑤𝑘 ) < 𝜃 for some threshold 𝜃. However, the Euclidean distance on the raw 3D positions is not suitable because it would be disturbed by differences in body size and orientation, as well as deviances in joints that are irrelevant for the specific pose (Fig. 1, a). To account for body size and orientation differences, we first translate the time series to angular space, in particular the azimuth and elevation of a joint compared to its parent joint in the human skeleton (Fig. 1, b). Further, we apply learnable ∑︀26 weights 𝛼𝑘,𝑙 for each pose 𝑘 and each joint 𝑙, resulting in the distance 𝑑(𝑥𝑡 , 𝑤𝑘 ) = 𝑙=1 𝛼𝑘,𝑙 · ‖𝑥𝑡,𝑙 − 𝑤𝑘,𝑙 ‖2 (Fig. 1, c). 2 To learn the weights 𝛼𝑘,𝑙 , we require training data where the correct matching between student and teacher is known. In particular, assume that we have one time series 𝑥𝑖1 , . . . , 𝑥𝑖𝑇𝑖 per student 𝑖. For each of these time series, assume that experts provided a label sequence 𝑇𝑖 ,𝑘 for each pose 𝑘, where 𝑦𝑡,𝑘 = 1 if pose 𝑘 should be recognized at time 𝑡, 𝑦𝑡,𝑘 = 0 𝑖 , . . . , 𝑦𝑖 𝑦1,𝑘 𝑖 𝑖 if pose 𝑘 should not be recognized at time 𝑡, and 𝑦𝑡,𝑘 𝑖 = −1 if we consider it irrelevant whether pose 𝑘 is recognized at time 𝑡 or not. a) Euclidean distance b) angular distance c) weighted ang. distance d) embedding distance - Figure 1: An illustration of different ways of computing distances between student poses 𝑥𝑡 (blue) and teacher poses 𝑤𝑘 (orange). a) Euclidean distance, b) angular distance, c) weighted angular distance (relevance learning / attention net), d) embedding distance (prototype net, prototype attention net). 1.5 d(xt , wk )2 yt,k = 1 1 0.5 0 0 2 4 6 8 10 12 14 16 t Figure 2: An illustration of the contrastive loss (1). If 𝑦𝑡,𝑘 = 1, we punish any distance larger zero (orange lines at 11, 12, 13, 14). Otherwise, we punish distances smaller than 1 (orange lines at 2, 3, and 15). Given this kind of training data, we learn the weights 𝛼𝑘,𝑙 by minimizing the contrastive loss 𝐾 ∑︁ ∑︁ ∑︁ ∑︁ [︀ 𝑑(𝑥𝑖𝑡 , 𝑤𝑘 )2 + 1 − 𝑑(𝑥𝑖𝑡 , 𝑤𝑘 )2 + , (1) ]︀ ℓ= 𝑖 𝑘=1 𝑡:𝑦 𝑖 =1 𝑖 =0 𝑡:𝑦𝑡,𝑘 𝑡,𝑘 where [1 − 𝑑]+ = max{0, 1 − 𝑑}. This loss punishes large distances 𝑑(𝑥𝑖𝑡 , 𝑤𝑘 ) if 𝑦𝑡,𝑘𝑖 = 1 and distances below 1 if 𝑦𝑡,𝑘𝑖 = 0 (Fig. 2). In other words, this loss tries to ensure that we can recognize pose 𝑘 correctly by checking if 𝑑(𝑥𝑡 , 𝑤𝑘 ) is smaller than 𝜃 = 1. We can optimize this loss by standard, gradient-based non-linear techniques, such as L-BFGS. We call this approach relevance learning, in line with [8]. Note that this scheme is, essentially, a simple metric learning scheme [6, 8]. It is also nicely intepretable because we can inspect the learned weights 𝛼𝑘,𝑙 and check whether they make sense to a domain expert. Further, we can provide feedback by highlighting joints 𝑘 to students where the weighted deviation 𝛼𝑘,𝑙 · ‖𝑥𝑡,𝑙 − 𝑤𝑘,𝑙 ‖2 is large. For example, imagine a virtual mirror with an avatar of the student where the avatar’s joint 𝑘 is color-coded as red, similar to the scheme of [2]. Unfortunately, relevance learning is limited to situations where the set of poses is fully known. For every new pose 𝑘, we need to record new training data to train new weights 𝛼𝑘,𝑙 , which may be infeasible. Instead, we would prefer an approach which can be applied to new poses without any re-training. To that end, we apply a two-layer feedforward neural network 𝑔 which receives the expert demonstration 𝑤𝑘 of a pose as input and maps it to weights 𝛼𝑘,𝑙 = 𝑔𝑙 (𝑤𝑘 ). By applying a sigmoid nonlinearity, we ensure that the weights remain in the range [0, 1]. This is, essentially, an attention mechanism, which have become popular for speech processing [7]. We call this approach attention net and train it with the same contrastive loss (1) as before. In addition to the attention mechanism, we also test the effect of a refined motion represen- tation. In particular, we use a 1D convolution with kernel length 31 and 32 filters, followed by a sigmoid nonlinearity and a linear layer which reduces the 32 filter dimensions to a sin- gle number1 . Finally, we apply another sigmoid nonlinearity and a linear layer from the 26 joints to 𝑛 latent dimensions which integrates information across joints. Overall, we obtain an 𝑛-dimensional representation 𝑓 (𝑥𝑡 ) of each motion frame 𝑥𝑡 . We compare to the correct )︀2 demonstration 𝑓 (𝑤𝑘 ) via the distance 𝑑(𝑥𝑡 , 𝑤𝑘 )2 = 𝑛𝑙=1 𝑔𝑙 (𝑤𝑘 ) · 𝑓𝑙 (𝑥𝑡 ) − 𝑓𝑙 (𝑤𝑘 ) (Fig. 1, ∑︀ (︀ d). We learn ∑︀ 𝑖 all neural network parameters by minimizing loss (1), but we add a regularization −1 term 𝜆 · 𝑇𝑡=1 ‖𝑓 (𝑥𝑖𝑡 ) − 𝑓 (𝑥𝑖𝑡−1 )‖2 to ensure that the learned motion representation is smooth over time. We call this approach prototype attention net because it integrates the concept of the attention net with the representation approach of prototypical networks [5]. As a final model, we also consider a prototype net where we omit the attention net and set the weights 𝑔𝑙 (𝑤𝑘 ) to 1, instead. 3. Experiments We compare relevance learning, attention net, prototype net, and prototype attention net on a dataset of 27 students and one teacher, all executing a series of 25 fitness and dance motion elements (namely squat, raise arms (2x), tree squat, “pirate” (6x), elbow to knee (3x), crossover (2x), airplane, airplane squat, standing, lunge (4x), T-pose, clock (2x)) while being recorded with a Kinect camera. We had 15 female and 12 male participants with mean age 27 (std. 4.49 years). Table 1 displays the self-reported prior experience of the participants with sports in general and video tutorials in particular. We note that the annotators only annotated the first frame where the students attempted a certain pose. To arrive at complete labels 𝑦𝑡,𝑘 𝑖 , we used a heuristic scheme where we automatically labeled 30 frames after the actual annotation with 𝑦𝑡,𝑘𝑖 = 1 and extended the annotation further as long as the 3D marker positions did not change beyond 5 times the average distance between adjacent frames. We further set 𝑦𝑡,𝑘 𝑖 = −1 for the 30 frames before and after the annotated 1 32 was chosen as the next power of 2 above the number of keyposes, which was 25. However, future work could investigate more hyperparameter combinations. Table 1 The number of participants with a specific level of self-reported prior experience with sports (top row) and video tutorials (bottom row). Expert Advanced Some None Sports 6 5 11 5 Video 7 13 6 1 Table 2 The average evaluation measures ± standard deviation across poses model recall precision F1 AUC relevance learning 0.81 ± 0.13 0.18 ± 0.09 0.28 ± 0.12 0.53 ± 0.23 attention net 0.60 ± 0.21 0.38 ± 0.15 0.43 ± 0.17 0.49 ± 0.22 prototype net 0.81 ± 0.16 0.53 ± 0.14 0.61 ± 0.15 0.72 ± 0.18 prototype attention net 0.81 ± 0.14 0.56 ± 0.12 0.64 ± 0.14 0.74 ± 0.16 region to ignore cases where the students was already/still close to the target pose but not close enough to count as training data. Even after this preprocessing, though, our dataset is highly imbalanced: 𝑦𝑡,𝑘 𝑖 = 1 is relatively rare, whereas 𝑦𝑡,𝑘 = 0 is common. Accordingly, we do not report accuracy but recall, precision, 𝑖 and F1 score, as well as the area under the precision-recall curve (AUC). Table 2 shows the results. As we can see, the prototype attention net performs best, according to all measures. A Wilcoxon signed-rank test revealed that the AUC for relevance learning and the attention net were both significantly lower (𝑝 < 10−3 ) but the AUC of prototype net and prototype attention net was statistically indistinguishable. This finding indicates that representation learning is more crucial than joint weighting in achieving good keypose detection results, at least on this particular dataset. 4. Conclusion We considered the problem of keypose detection for a sequence fitness and dance motion elements in a few-shot setting, where only a single teacher demonstration and few student demonstrations per pose exist. To detect keyposes, we evaluated methods which compute a distance between teacher demonstrations and student frames and detect a keypose if the distance is below 1. We compared several schemes to arrive at a distance, namely (1) relevance learning, which optimized joint weights for each keypose, (2) an attention neural net which inferred the joint weights from the respective teacher demonstration, (3) a prototype network which represented both teacher and student motion in a latent space before computing distance, and (4) a combination of prototype and attention network. As expected, the prototype attention net (4) performed best but we found that the prototype net (3) performed nearly as well. Therefore, we conclude that representation learning is more crucial compared to attention, at least for our example. In future work, it should be investigated how well a prototype network generalizes to new keyposes it was not trained on and how far performance can be improved with refined architectures. Beyond keypose detection, future work should investigate the ability to recognize entire motion, in addition to purely static poses. For all these future research opportunities, we believe that our proposed loss function and training scheme can contribute to robust detection approaches, which in turn can become a crucial component for new feedback methods in psychomotor learning. Acknowledgments Funding by the German Federal Ministry for Research and Education (BMBF) for the project MILKI-PSY (grant no. 16DHB4014) is gratefully acknowledged. References [1] S. Anu, V. Ele, Teaching dance in the 21st century: A literature review, The European Journal of Social & Behavioural Sciences 7 (2013) 624–640. doi:10.15405/ejsbs.2013.7. issue-4. [2] F. Hülsmann, C. Frank, I. Senna, M. O. Ernst, T. Schack, M. Botsch, Superimposed skilled performance in a virtual mirror improves motor performance and cognitive representation of a full body motor action, Frontiers in Robotics and AI 6 (2019) 43. doi:10.3389/frobt. 2019.00043. [3] B. Paaßen, M. Kravčík, Teaching psychomotor skills using machine learning for error detection, in: R. Klemke, K. Asyraaf Mat Sanusi, et al. (Eds.), Proceedings of the 1st International Workshop on Multimodal Immersive Learning Systems (MILeS 2021), 2021, p. 8–14. URL: http://ceur-ws.org/Vol-2979/paper1.pdf. [4] T. K. Vintsyuk, Speech discrimination by dynamic programming, Cybernetics 4 (1968) 52–57. doi:10.1007/BF01074755. [5] J. Snell, K. Swersky, R. Zemel, Prototypical networks for few-shot learning, in: I. Guyon, U. V. Luxburg, S. Bengio, et al. (Eds.), Proc. NeurIPS, 2017, pp. 4077–4087. [6] A. Bellet, A. Habrard, M. Sebban, A survey on metric learning for feature vectors and structured data, arXiv 1306.6709 (2014). [7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polo- sukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, et al. (Eds.), Proc. NeurIPS, 2017, pp. 5998–6008. [8] B. Hammer, T. Villmann, Generalized relevance learning vector quantization, Neural Networks 15 (2002) 1059–1068. doi:10.1016/S0893-6080(02)00079-5.