Video Analytics for Volleyball: Preliminary Results and Future Prospects of the 5VREAL Project Andrea Rosani, Ivan Donadello, Michele Calvanese*, Alessandro Torcinovich, Giuseppe Di Fatta, Marco Montali and Oswald Lanz Libera Università di Bolzano, Piazza Università 1, Bozen-Bolzano, 39100, Italy Abstract This paper introduces a real-time action recognition and tactical-behavior mining system designed specifically for volleyball games. The system aims to provide data augmentation, video annotation and KPI extraction processes by accurately identifying various actions and action sequential patterns performed during volleyball matches. Leveraging advanced computer vision techniques, the system aims at automatically detecting and recognizing player actions and group actions in real time. Then, Process Mining techniques are used to extract tactical behaviors, in the form of temporal relations, among player actions. By providing precise annotations, the system significantly provides an instrument for volleyball game analytics and tactical analysis. This paper outlines the architecture and key components of the real-time action recognition and tactical-behavior mining system and presents some preliminary results on the performance of the proposed model. Keywords Video action recognition, data augmentation, video annotation, process mining, sports 1 1. Introduction developments obtained. First, a review of some Over the past decade, action recognition in particularly relevant works in the specific field is professional sport activities has rapidly gained proposed. Then, methods and algorithms are popularity as a tool for a variety of tasks such as player described, along with some results of preliminary experiments on a public dataset [7]. performance analytics, computer-aided game refereeing, and the like. In response to this interest, 1.1. Context: the 5VREAL Project several action recognition systems have been devised This paper describes the preliminary results obtained in the context of several sports, such as football, during the activity related to the project 5VREAL – 5G basket, rugby, etc. Volley Reality Experience & Analytics Live, focused on In this context, this paper presents an action the study and implementation of a system for the recognition system for volleyball game analysis. The acquisition, analysis and transmission of video and preliminary results obtained during the activity focus analytics in the context of volleyball games and on the detection of actions, events, and tactical training sessions. The project aims to create a scalable behaviors in volley with the final objective of solution, which can be used at all levels of providing a reliable Ai-powered data augmentation competition, professional and amateur. system that can be used for the TV broadcasting of Two use cases are developed: volley games in a real time scenario, as well as for off- • Fun Engagement: This use case aims to use line analytics activities, starting from the video artificial intelligence algorithms to enrich collected by a multi view source and shared using 5G the spectator’s experience while watching transmission. the match with augmented reality The document is structured into several sections information displayed in real time on the that outline in detail the study process and the broadcasted videos. Ital-IA 2024: 4th National Conference on Artificial Intelligence, 0009-0008-2622-6776 (A. Rosani); 000-0002-0701-5729 organized by CINI, May 29-30, 2024, Naples, Italy (I. Donadello); 0009-0005-4103-0147 (M. Calvanese); ∗ M. Calvanese contributed with work done during his Master 0000-0001-8110-1791 (A. Torcinovich); 0000-0003-3096- Thesis project at UPC Barcelona with Prof. Carlos Andujar Gran. 2844 (Di Fatta), 0000-0002-8021-3430 (M. Montali), 0000- 0003-4793-4276 (O. Lanz) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings • Coach: Use of the game & ‘rhythm’ for 3. Methodology and algorithms technical staff. After the game, the technical staff or directly the coach receives 3.1. General architecture of the system indications on positions, speed, trajectories, The AI block consists of a set of algorithms required to time intervals between touches and higher- a) identify the position and trajectory of the ball, b) level semantic information about the tactical identify the position of individual players, and c) behaviors of the team that can favor a more detect and identify actions performed within a in-depth technical and tactical analysis. specific timeframe. The involvement in the project of industrial The acquisition of images for AI occurs through partners operating in the media production sector three iPhone 14 Pro devices mounted tripods with will enable a real application scenario to test the calibrated cameras, connected to a backend via 5G, performances of the proposed solution. The project is producing synchronized SRT (Secure Reliable funded by the Italian Ministry of Enterprises and Made in Italy, MIMIT under the MIMIT FSC 2014-2020: Transport) compressed video streams. Tecnologie 5G. Progetti di sperimentazione e ricerca – Piano di Sviluppo e Coesione 2014-2020. 2. State of art in action recognition and tactical behavior for volley The task of action/pose estimation involves analyzing video content to track one or more persons of interest and identify their key anatomical features, typically Figure 1: Overview of the architecture of the defined as keypoints [14], [26]. When multiple actors volleyball action recognition system. interact, the task is usually referred as Group Activity The ball localization module starts the processing Recognition (GAR) [18], [19], [22]. by producing a continuous data stream of the ball GAR algorithms differ in how they model spatial trajectory. When a change in its direction is detected, and temporal information in videos. Some dated the player tracking and action detection modules are approaches apply recurrent models: [7] develops a activated (Figure 1). This generates an output of the hierarchical model based on two long-short term events occurred in the selected timeframe. In the memory (LSTM) models, [13] proposes a recurrent following, we analyze in detail the different steps. 3D neural network (RNN) model with attention Ball tracking is described by a project partner in another submission to Ital-IA 2024. mechanisms and semantic graphs, [3] generates a map of candidate regions of interest and uses an RNN 3.2. Ball trajectories change detection architecture for temporal processing, and [24] adopts The general scheme for ball trajectory analysis can be subdivided in the following steps (Figure 2): a top-down approach using Gated Recurrent Unit. 1. Identification of possible candidate ball Other works focus on convolutional mechanisms: positions. [2] develops a convolutional relational machine for 2. Incremental interpolation of candidates with GAR, [19] works on individual poses using one- parabolic trajectories, producing a parabola dimensional convolutional neural networks. for each frame. Newer models like graph-based networks and 3. Linking of trajectories from which to derive Transformers are also employed: [25] uses a graph- the motion of the ball. based model for spatio-temporal relationships, 4. Detection of trigger events when the ball designs a descriptor for crowded scenarios, and [10] undergoes an upward acceleration, such as a [12] proposes a Transformer-based solution for player touching or a bounce on the floor. processing spatial and temporal information. The algorithm, originally proposed in [5], requires To recognize tactical behaviors, techniques like as input the positions of the ball at each time step, that sequence mining algorithms and Inductive Logic can be easily devised with a ball tracking system [14]. Programming are used ([21], [19], [23]). Works in this The path of the ball is modelled by a piecewise field include [9] and [11] for predicting complex parabolic trajectory. Initially, seed triplets are events from football matches using Answer Set identified within a threshold distance (𝑟). Programming and Subgraph Discovery. In our work, These triplets serve as initial anchors for parabolic temporal pattern mining algorithms based on Linear fitting. Due to false positives, multiple seed triplets Temporal Logics will be used, offering a different per frame may exist. Each triplet is used to fit a approach compared to the mentioned works. parabola, and candidate detections close to the estimated position are added to a set of supporting actions in different environments [4]. These studies points. focus on extracting meaningful information from videos, by detecting and recognizing what a subject is doing [15], [16], [17]. The posture detection occurs within the video stream, in the player's bounding box, that is the area of interests of an object (the player, in this case) tracked in each video frame. The detection of the posture uses pose estimation technologies based on machine learning models [24], that identify key anatomical features of players, such as joints, extremities, center of mass, etc., commonly referred to as keypoints [8]. In the case of a volleyball player, the bounding box is used to locate the player's position within the video frame and subsequently extract Figure 2: Ball trajectories analysis and trigger event keypoints on the players' bodies (Figures 4 and 5). detection [5]. The temporally furthest points within the support set are used to fit a new parabola. This iterative process continues until the set of supporting points ceases to grow. Parabolas with upward-pointing acceleration vectors are excluded as they violate physical constraints. Figure 4: Example annotation from the Volleyball dataset showing the bounding box of each player divided by team (using different colors) and the action performed ("Left spike"). (Image from [7]) Starting from this information is possible to perform action recognition, as demonstrated effectively in [16], [17] that will be used as reference in the project for this specific task. Figure 3: Action and Group Activity Recognition 3.4. Team activity recognition (images from [7]). The variation in the ball trajectory The challenge of Group Activity Recognition (GAR) identifies an interaction that triggers the event. requires addressing two main aspects. First, it To ensure a unique parabola per frame, trajectory demands a compositional understanding of the scene. distances are computed and used to construct a Due to the relatively high number of people present in the scene, it's challenging to learn meaningful weighted graph. Dijkstra's algorithm [6] identifies the representations for GAR over the entire area. Since optimal path through this graph, yielding the final group activities often involve subgroups of actors and sequence of parabolas describing the ball's path. scene objects, the final label of the action depends on Considering that the action mainly occurs around a compositional understanding of these entities. the ball's position, the proposed solution allows for Secondly, GAR benefits from relational reasoning on detecting changes in the direction of the ball due to scene elements to understand the relative importance gameplay interactions. This trajectory variation of entities and their interactions [26]. triggers an analysis mechanism of the activities 4. Preliminary results performed near the contact point to activate the In the following, we present some preliminary results subsequent phase of recognizing the actions of obtained using state-of-the-art techniques on public individual players and teams (Figure 3). available datasets. 3.3. Individual player action recognition 4.1. Dataset In the rapidly evolving field of action recognition, The Volleyball dataset [7], represents a significant many datasets, structures, and architectures have resource in the context of sports action recognition, been introduced to address the challenges and specifically on volleyball. Although originally complexities associated with understanding human designed for athlete action recognition, the dataset has been extended to include the task of 2D ball detection in the image. The dataset comprises a total of 4830 frames from 55 videos, offering a wide variety of actions and activities to analyze (Figure 4). In the dataset, there are nine annotations for individual player actions and eight group activities, detailed in Table 1. Table 1 Classes of individual player activities are listed, and group actions, including the number of instances. Figure 6: Example 2D application of player Action No. of Group Activity No. of identification and identification of ball trajectory Classes Instances Class Instances changes ("trigger"). Keypoints can be observed on Waiting 3601 Right set 644 each player's silhouette, along with the corresponding Setting 1332 Right spike 623 arc of the ball trajectory. Digging 2333 Right pass 801 Falling 1241 Right winpoint 295 Spiking 1216 Left winpoint 367 Blocking 2458 Left pass 826 Jumping 341 Left spike 642 Moving 5121 Left set 633 Standing 38696 4.2. Group activity recognition GAR is performed at different levels. Initially, the keypoints of the various players are extracted. Based on these, an estimation of the action each player is doing is defined, and then related to the predicted level of person-to-person and person-to-group interaction. 4.2.1. Trigger event identification and GAR The situation that activates the GAR mechanism is represented by the trigger, identified with the change of the ball direction (Figure 5). Figure 7: Our results on the Volleyball dataset considering the Olympic Split [7], [26]. In the first confusion matrix we represent GAR, in the second one the single player activities. Like humans, object representation is performed at various granularities, as well as reasoning about their interactions to transform sensory signals into high- level knowledge. GAR is addressed by modeling a video as a set of tokens representing multi-scale Figure 5: -Detailed schema for action and group semantic concepts present in the video, thus allowing activity recognition. the described method to be easily adaptable to In Figure 6 we present some frames from [7], understand any video with multi-actor multi-object processed using the proposed algorithms, detailed in interactions. the following section, allowing for a comprehensive In the specific case of volleyball, the actors are visualization of the keypoints of the various players represented by the players, while the object is combined with the trajectories of the ball represented by the ball. These tokens include 4.2.2. Hierarchy of semantic events for GAR keypoints, people, person-to-person interactions, Taking inspiration from the approach proposed in person-to-group interactions, and object interactions. [26], composite learning of entities in the video and The performance of this analysis, compared to relational reasoning on these entities is established. previous techniques based on standard RGB analysis (i.e., considering the entire images and not just the with Linear Temporal Logic over finite traces (LTLf), keypoints), shows significant accuracy (Figure 7) one of the reference logics in the field [28]. Examples 4.3. Tactical behavior of such templates are the Chain Response between By tactical behavior, we mean a set of temporal actions A and B that means that action A must be relationships among volleyball actions that can lead to immediately followed by action B or the Alternate an outcome of particular interest, such as scoring a Precedence between A and B that means that action B point. In what follows we provide a conceptual must be preceded by action A without any other framework to formally define tactical behaviors and occurrence of B in between, see [27] Table 2. In use Process Mining (PM) techniques for mining addition, RuM provides the selection of a numeric tactical behaviors from annotated volleyball matches. support that indicates the percentage of occurrence of 4.3.1. A conceptual model for tactical behaviors a particular template in the set of matches that can be A tactical behavior is a set of temporal relationships used as a key process indicator. The 55 Volleyball over events in a volleyball match. An event is the main matches were analyzed in less than 10 seconds, a action of a player on the ball which has a start time, an suitable performance for an offline scenario. With a end time, a set of players involved with information support of 20%, we obtained 50 tactical behaviors related to their pose, their bounding boxes, their expressed using LTLf templates, automatically unique identifiers, the quality of the action and the translated by the tool in natural language sentences position of the ball. For example: for a better human comprehension. An example of • A dunk by a player from area A1 is mined tactical behavior is that in the 47.73% of the immediately followed by a point scored. matches, each jump (for a block) is preceded by a • A reception (with low quality) of a player is dunk without any other jump in between. In addition, immediately followed by a point. RuM also allows us to link the tactical behaviors of actions to the other concepts of the above conceptual Our conceptual model for a volleyball event is scheme. shown in Figure 8. Figure 8: The conceptual model for volleyball events. Figure 9: The conformance checking analysis of A volleyball match is therefore a sequence of predefined tactical behaviors. annotations of volleyball events in chronological RuM also supports the manual definition of order. Such events are annotated with the use of the tactical behaviors and the analysis of the matches computer vision techniques above or provided by according to such predefined behaviors. This task is scoutmen. called conformance checking and, as two examples of 4.3.2. Process Mining for tactical behaviors tactical behaviors, we defined that a jump is followed Process Mining [20] embraces Data Mining and by a spike and that a spike is followed by a block. Knowledge Representation and focuses on the Figure 9 shows the results of the RuM conformance analysis and improvement of business processes checking. based on data collected from the information systems. Each behavior is analyzed for each match and, on One of its key features is the availability of tools for the right, the actions of match 5 are shown and mining information from temporal discrete data. We highlighted in green if they conform to the tactical analyzed the matches of the Volleyball dataset behavior, in red otherwise. (converted in a suitable format) with the Process Acknowledgements Mining RuM (Rule Mining Made Simple) tool [1] to This work is supported by 5VREAL – 5G VOLLEY mine tactical behaviors. REALITY EXPERIENCE & ANALYTICS LIVE, CUP RuM extracts temporal relations among actions of I53C23001340005, funded by Italian Ministry of volleyball events through a list of templates defined Enterprises and Made in Italy. References [15] Sudhakaran S, Escalera S, Lanz O: Gate-Shift Networks for Video Action Recognition, IEEE [1] Alman, A., Donadello, I., Maggi, F. M., Montali, M. Declarative Process Mining for Software CVPR 2020 Processes: The RuM Toolkit and the Declare4Py [16] Sudhakaran S, Escalera S, Lanz O: Gate-Shift- Python Library. In Int. Conf. on Product-Focused Fuse for Video Action Recognition, IEEE TPAMI, 2023 Sw Process Improvement (2023). [17] Takahashi M, Ikeya K, Kano M, Ookubo H, [2] Azar S.M., Atigh M.G., Nickabadi A., Alahi A.: Convolutional relational machine for group Mishina T: Robust Volleyball Tracking System activity recognition. In: IEEE CVPR. (2019) Using Multi-View Cameras. ICPR, 2016 [18] Thilakarathne H., Nibali A., He Z., Morgan S.: Pose [3] Bagautdinov, T., Alahi, A., Fleuret, F., Fua, P., is all you need: The pose only group activity Savarese, S.: Social scene understanding: End- recognition system (pogars). arXiv preprint to-end multi-person action localization and collective activity recognition. IEEE CVPR. arXiv:2108.04186 (2021) (2017) [19] Van Haaren, J., Ben Shitrit, H., Davis, J., Fua, P. (2016, August). Analyzing volleyball match data [4] Camarena F, Gonzalez-Mendoza M, Chang L, from the 2014 world championships using Cuevas-Ascencio R: An Overview of the Vision- Based Human Action Recognition Field, Math. machine learning techniques. In Proceedings of Comput. Appl. 2023 the 22nd ACM SIGKDD (pp. 627-634). [5] Calvanese M: Ball tracking in Padel Videos using [20] Van Der Aalst, W., van der Aalst, W. (2016). Data science in action (pp. 3-23). Springer Berlin Convolutional Neural Networks. [Laurea magistrale], Università di Bologna, Corso di Heidelberg. Studio in Artificial intelligence, 2023 [21] Wenninger, S., Link, D., Lames, M. (2019). Data [6] Dijkstra E.W: A note on two problems in mining in elite beach volleyball–detecting tactical patterns using market basket analysis. connexion with graphs. Numerische IJCSS, 18(2), 1-19. mathematik, 1959 [22] Wu L.F., Wang Q., Jian M., Qiao Y., Zhao, B.X.: A [7] Ibrahim MS, Muralidharan S, Deng Z, Vahdat A, Mori G. A hierarchical deep temporal model for comprehensive review of group activity group activity recognition. CVPR, 2016 recognition in videos. International Journal of Automation and Computing pp. 1–17 (2021) [8] Jiang T, Lu P, Zhang L, Ma N, Han R, Lyu C, Li Y, [23] Xia, H., Tracy, R., Zhao, Y., Fraisse, E., Wang, Y. F., Chen K: RTMPose: Real-Time Multi-Person Pose Petzold, L. (2022, November). VREN: volleyball Estimation based on MMPose. ArXiv, 2023 [9] Khan, A., Bozzato, L., Serafini, L., Lazzerini, B. rally dataset with expression notation language. (2019). Visual reasoning on complex events in In 2022 IEEE ICKG (pp. 337-346). [24] Xu D., Fu H., Wu L., Jian M., Wang D., Liu X.: Group soccer videos using answer set programming. In activity recognition by using effective multiple GCAI 2019. [10] Li J, Wang C, Zhu H, Mao Y, Fang H, Lu C.: modality relation representation with temporal- CrowdPose: Efficient Crowded Scenes Pose spatial attention. IEEE Access 8, (2020) [25] Yan R., Xie L., Tang J., Shu X., Tian Q.: Higcin: Estimation and A New Benchmark, CVPR, 2019 hierarchical graph-based cross inference [11] Meerhoff, L. A., Goes, F. R., De Leeuw, A. W., network for group activity recognition. IEEE Knobbe, A. (2020). Exploring successful team tactics in soccer tracking data. In Machine TPAMI (2020) Learning and Knowledge Discovery in [26] Zhou H, Kadav A, Shamsian A, Geng S, Lai F, Zhao L, Liu T, Kapadia M, Graf HP: COMPOSER: Databases: Int. Workshops of ECML PKDD 2019. Compositional Reasoning of Group Activity in [12] Nabi, M., Bue, A., Murino, V.: Temporal poselets Videos with Keypoint-Only Modality. ECCV, for collective activity detection and recognition. In: IEEE CVPR. pp. 500–507 (2013) 2022 [13] Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., Van Gool, L.: [27] Donadello, I., Di Francescomarino, C., Maggi, F. M., Ricci, F., Shikhizada, A. Outcome-oriented stagnet: An attentive semantic rnn for group prescriptive process monitoring based on activity recognition. In: Proc. of the ECCV. (2018) [14] Rahimian P, Toka L: Optical tracking in team temporal logic patterns. Engineering sports: A survey on player and ball tracking Applications of Artificial Intelligence (2023). methods in soccer and other team sports. [28] Claudio Di Ciccio, Marco Montali: Declarative Process Specifications: Reasoning, Discovery, Journal of Quantitative Analysis in Sports, 2022 Monitoring. Process Mining Handbook 2022.