Preserving and Annotating Dance Heritage Material
                                through Deep Learning Tools: A Case Study on Rudolf
                                Nureyev
                                Silvia Garzarella1,† , Lorenzo Stacchio2,∗,† , Pasquale Cascarano1 , Allegra De Filippo3 ,
                                Elena Cervellati1 and Gustavo Marfia1
                                1
                                  Department of the Arts, University of Bologna, Italy
                                2
                                  Department of Political Sciences, Communication and International Relations, University of Macerata, Italy
                                3
                                  Department of Computer Science and Engineering University of Bologna, Italy


                                            Abstract
                                            The cultural heritage of theatrical dance involves diverse sources requiring complex multi-modal ap-
                                            proaches. Since manual analysis methods are labor-intensive and so limited to few data samples, we here
                                            discuss the use of the DanXe framework, which combines different AI paradigms for comprehensive
                                            dance material analysis and visualization. However, DanXe lacks models and datasets specific to dance
                                            domains. To address this, we propose a human-in-the-loop (HITL) extension to the DanXe to accelerate
                                            multi-modal data labeling through semi-automatic, high-quality data labeling. This approach aims to
                                            create detailed datasets providing humans with a set of user-friendly and effective tools for advancing
                                            multi-modal dance analysis and optimizing AI methodologies for dance heritage documentation. To this
                                            date, we designed a novel middleware that allows us to adapt data generated from visual Deep Learning
                                            (DL) models within DanXe to visual annotation tools, to empower domain experts with a user-friendly
                                            tool to preserve all the components included in the choreographic creation, enriching the process of
                                            metadata creation.

                                            Keywords
                                            Artificial Intelligence, Data Labeling, Deep Learning, Cultural Heritage, Dance


                                1. Introduction
                                The cultural heritage of theatrical dance consists of a multitude of sources, both tangible and
                                intangible. These sources are diverse by nature and type, location, and preservation methods,
                                creating a complex constellation that requires a diverse set of skills to be effectively enhanced [1].
                                Acknowledging this complexity is inherently tied to a comprehensive and integrated analysis
                                of theory and practice, with significant implications in terms of accessibility [2]. Consider-
                                ing in particular choreography, while historiographical approaches are essential for working
                                with written documentation, thorough analysis requires an understanding that often involves
                                observing execution techniques [3] [4] [5].
                                International Workshop on Artificial Intelligence and Creativity (CREAI), co-located with ECAI 2024
                                ∗
                                    Corresponding author.
                                †
                                    These authors contributed equally.
                                Envelope-Open silvia.garzarella3@unibo.it (S. Garzarella); lorenzo.stacchio@unimc.it (L. Stacchio);
                                pasquale.cascarano2@unibo.it (P. Cascarano); allegra.defilippo@unibo.it (A. De Filippo); elena.cervellati@unibo.it
                                (E. Cervellati); gustavo.marfia@unibo.it (G. Marfia)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Due to this challenge, we have attempted to envision a framework that allows for an integrated
approach to theatrical dance’s documentary assets, using the artistic and cultural legacy of
dancer Rudolf Nureyev (1938-1993) as a case study [6]. The decision to focus on Nureyev
stemmed from the distinctive nature of the documentary heritage associated with him. He
was one of the first dancers to experience extensive and varied media coverage, given the
period during which his career developed (the 1960s and 1980s). This widespread mediatization,
during a transformative era for both dance and media, underscores the unique and multifaceted
nature of his legacy, making him a pivotal example for the dance domain. However, it is worth
noticing that in this case, as in others like it, the large amount of data available (e.g., dance
videos, playbills, or biographical documents) and their international distribution often lead
dance experts to apply multi-modal analysis to a limited number of samples. [7].
   Such approaches exhibit three main limits: (a) even being an expert, the process of analyzing
such data by hand is time-consuming; (b) the outcomes of such analyses would be hard to
organize and visualize in an effective way (e.g., discover correlations); (c) it would prevent
discovering semantical knowledge that could be only found by adopting a multi-modal analytical
approach on a vast amount of data [8, 7, 9, 10]. To face all such challenges, Computational
Dance (COMD) paradigms amount to a possible solution. However, COMD is underserved by
comprehensive datasets, limiting the potential for in-depth research and development [7, 11, 12].
This lack is even greater when considering multi-modal dance datasets: the majority of datasets
were collected for uni-modal analysis, in particular for the choreographic one [7, 11, 12].
Such datasets would be fundamental to optimizing AI methodologies capable of automatically
extracting knowledge and labels from dance digital material [11].
   In such a line, a recent work introduced a unified multi-modal analysis tool, DanXe, an
Extended Artificial Intelligence framework that blends (i) AI algorithms for digitization and
automated analysis of both tangible and intangible materials, with the goal of crafting a digital
replica of dance cultural heritage, and (ii) XR solutions for immersive visualization of the
derived insights. This framework introduces a novel space for the concurrent analysis of all
elements that define the essence of dance. [12]. For the here considered use case, the AI
analysis module of DanXe can be effectively used to extract knowledge for different kinds
of dance heritage materials, since it employs different Deep Learning (DL)-based models to
examine dance heritage materials, ranging from textual, audio, visual, and 3D data, providing
a foundational framework for multi-modal dance analysis. However, such a framework does
not resort to models specifically designed for the dance arena, for domains different from
choreography, exhibiting again a lack of models and datasets.
   Tools like DanXe can be employed to digitalize dance heritage and at the same time accelerate
the labeling of multi-modal dance data, which can be used to train multi-modal models, that
can be employed to improve heritage preservation and analysis. Nevertheless, the integration
of human experts is required to ensure the quality of generated data and provide novel and
connected knowledge to those. For this reason, we here propose an extension of the DanXe
framework to inject a human-in-the-loop (HITL) component that leverages the initial AI-inferred
annotations as a foundation, enabling a semi-automatic approach to provide high-quality labels.
This approach aims to facilitate the creation of richer and more accurate datasets to support the
optimization of future heritage preservation models. We here contextualize such an approach
for a multi-modal dance data annotation process, considering the specific case of choreography,
where there is a lack of datasets that capture fine-grained labels of specific dance moves, often
focusing on the general style [13, 14, 11].


2. Materials and Methods
We here provide a detailed overview of the materials and methodologies employed in our
study. We begin with the Video Dataset subsection, which describes the collection and
characteristics of the video data used for analysis. Following this, the AI Augmented Human
Annotator subsection outlines the HITL approach that leverages an AI Dance toolbox to
enhance human annotation efficiency and accuracy. Finally, the Visual Annotation Tool
Integration Middleware subsection discusses the middleware designed and implemented
to seamlessly integrate the synthetic AI-generated annotations in a visual annotation tool,
facilitating a cohesive and streamlined workflow for annotation domain experts. Each subsection
aims to elucidate the integral components and techniques critical to our research process.

2.1. Video dataset
The dataset was created using materials from the case study, which were originally recorded on
film, distributed in cinemas and on VHS, and later digitized. The original recording format often
suffered from wear and tear (e.g., film damage, darkening). Additionally, the original intended
use, designed for cinemas or home video viewing, included video direction elements such as
close-ups, zooms, and fade-ins/outs. These elements often cover movements and are not ideal
for a comprehensive recording of the performance. The process of selecting a video for building
the initial basic dataset was therefore inevitably influenced by the need for well-lit footage, the
highest possible definition, and minimal directorial interventions. To further reduce noise (e.g.
background dancers, extras), it was decided to analyze a solo performance: Nureyev’s adagio of
Prince Siegfried in Swan Lake (Act I). In the analysis of this adagio, we’ve focused on the initial
20 seconds of choreography. Here, the dancer transitions from a static pose, embodying their
character without movement, to a sequence of steps performed in place. These steps showcase
a range of volumes and heights, adding depth and dimension to the performance.

2.2. AI Augmented Human Annotator
Considering our main use case, choreographic-related data, various labels, and information can
be inferred, including music, dance styles, individual dance moves, and background descriptions.
Some of this information could be inferred with a high degree of accuracy by modern DL
approaches, like the ones introduced in the DanXe platform [12]. Despite this rich potential,
it remains challenging for human experts to adjust, enhance, and integrate novel labels or
information clearly and visually on top of this generated data. On this line, visual annotation
tools (VATs) could be exploited [15]. In fact, the primary advantage of VATs is their ability to
significantly reduce the manual effort required from users, even those who are non-experts. By
incorporating various functionalities for manual, semi-automatic, and automatic annotations
through advanced AI algorithms, VATs could accelerate high-quality data labeling [15], given
also by the natural quantitative and qualitative approach introduced in such a process [16].
Given such consideration, we employed the DanXe visual annotation module as a black box
capable of inferring different relevant data for the dance visual domain, such as textual data
within pictures, human pose estimation, and semantic segmentation and defining a novel
visual-annotation-based framework on top of it. This is visually represented in Figure 1.


Figure 1: HITL Augmented Semi-automatic Annotation Architecture.

   This annotation layer assumes that all synthesized AI label data are stored in a local database
after their inference. A data adaptation middleware ingests and transforms the various data
formats inferred by different AI models, ensuring compatibility with the visual annotation
tool at hand. This setup enables human annotators to use the tool to correct and add new
labels on top of the existing information. Subsequently, the updated annotations are re-adapted
and stored in the database, following the inverse chain of processes. This iterative approach
facilitates the efficient enhancement of dance video annotations, leveraging both AI and human
expertise. To implement such a framework, a fundamental step amounts to defining a smart
middleware able to bridge different file formats and data structures coming from AI models
and make them interpretable from different visual annotation tools. For this reason, in the
following, we will describe the general architecture of the middleware we defined to ingest and
adapt annotations coming from different AI tools to visual annotation tools.

2.3. Visual Annotation Tool Integration Middleware
In response to the growing complexity of data formats and interpretation regarding different
tasks (e.g. Human Pose Estimation), a middleware solution has been developed to foster
interoperability between diverse AI models and various visual annotation tools. This middleware
serves as a bridge, facilitating seamless communication and data exchange between different
components of the annotation pipeline. Its architecture is reported in Figure 2. Implementing
standardized interfaces and protocols, enables the integration of multiple deep learning models,
each specializing in different aspects of visual analysis, such as pose estimation or object
detection. Simultaneously, the middleware performs a conversion of the ingested data respecting
Figure 2: Middleware Architecture.


a range of different visual annotation tools interfaces, providing a unified platform for annotators
to interact with and refine the output of these models. Through this interoperability, the
annotation process is augmented, offering annotators the flexibility to leverage the strengths
of different models while at the same time having user-friendly interfaces. Moreover, by
automating certain aspects of annotation and providing semi-automatic functionalities, the
middleware accelerates the annotation workflow, significantly reducing the time and effort
required to generate high-quality annotated datasets.


3. Results
We concretely applied our introduced methodology to accelerate single dance moves annotation
from a multi-modal perspective (i.e., linking human pose estimation and single dance moves). To
the best of our knowledge, this is the first attempt to do so through a custom-defined middleware
and semi-automatic approach.


Figure 3: Visual Keypoints inferred by AlphaPose.


  In particular, we took as a use case choreographical human pose estimation by using the
AlphaPose models 1 that were included in the DanXe pipeline. AlphaPose allows to extract and
1
    https://github.com/MVIG-SJTU/AlphaPose
     track multi-person poses, codified into 17 body key points when used in the model trained on
     the COCO dataset [17]. In our case, we applied it for a variation from Swan Lake performed
     by Rudolf Nureyev in 1967. The human pose estimation extracted from AlphaPose is stored
     in a JSON file which contains one record per each frame where a person was detected. Some
     visual representations of the inferred key points are reported in Figure 3 while an example of
     the resulting JSON file is provided in Listing 1.
           Listing 1: Human pose estimation JSON data generated by AlphaPose on a single image.
1    {       "image_id": "0.jpg",
2            "category_id": 1,
3            "keypoints": [
4                311.9952087402344, 307.96734619140625, // nose
5                314.58056640625, 305.3819580078125, // right eye
6                311.9952087402344,304.08929443359375, // left eye
7                322.336669921875, 305.3819580078125, // right ear
8                309.40985107421875, 305.3819580078125, // left ear
9                330.0927734375, 322.1868591308594, // right shoulder
10               306.8244934082031,322.1868591308594, // left shoulder
11               333.9708251953125,340.2843933105469, // right elbow
12               299.0683898925781 338.9917297363281, // left elbow
13               327.5074157714844,357.0892639160156, // right wrist
14               286.1415710449219,350.6258544921875, // left wrist
15               322.336669921875, 357.0892639160156, // right hip
16               310.7025451660156,355.7966003417969, // left hip
17               323.6293640136719,386.82098388671875, // right knee
18               314.58056640625, 386.82098388671875, // left knee
19               323.6293640136719, 415.260009765625, // right ankle
20               318.4586181640625,411.3819580078125 // left ankle
21           ], "score": 3.0010504722595215,
22           ... }

        In this example, there was only one person identified (category ID 1), indicating the human
     ID within the considered frame. The key points array contains precise x and y coordinates along
     with confidence scores for various body joints, exemplified by the first key point positioned at
     (311.995, 307.967). We do not include the confidence score provided for each key point inferred
     for description simplicity. As mentioned, those are 17 key points, corresponding to the nose,
     eyes, ears, shoulders, elbows, wrists, hips, knees, and ankles. Each key point serves as a precise
     indicator of a specific body part’s location within the image frame. The overall confidence in
     the pose estimation is quantified by a score of 3.001. Also, information related to the bounding
     box enclosing the detected human figure is inferred but was not reported for simplicity.
        Starting from this representation, we then considered adapting it for our target visual an-
     notation tool Vidat 2 , which could be exploited. Vidat is a high-quality video annotation tool
     for computer vision and machine learning applications that is simple and efficient to use for
     2
         https://github.com/anucvml/vidat
     a non-expert and supports multiple annotation types including temporal segments, object
     bounding boxes, semantic and instance regions, and human pose (skeleton). Moreover, it is
     completely data-driven: all the data can be stored and loaded by encoding them in a predefined
     key-value structure (i.e., a JSON file). Our goal was to load the annotated data from AlphaPose
     in a format readable by Vidat. However, the Vidat skeleton structure description does not take
     into account the elbow data. This means that we first filtered out the data per each detection
     and then re-adapt the remaining information to match the reading structure of the Vidat tool.
     The resulting JSON is reported in Listing 2.

                Listing 2: JSON representation of video annotations and configurations.
1    {   ...,
2        "objectAnnotationListMap": {},
3        "regionAnnotationListMap": {},
4        "actionAnnotationList": [],
5        "skeletonAnnotationListMap": {
6             "0": [{ ...,
7                     "pointList": [
8                         {   "id": 0, "name": "nose",
9                              "x": 312.0, "y": 308.0},
10                        {   "id": 1, "name": "left eye",
11                            "x": 312.0, "y": 304.0},
12                        {   "id": 2, "name": "right eye",
13                            "x": 315.0, "y": 305.0},
14                        ...
15                    ], "centerX": 315.67, "centerY": 346.13}
16            ]
17       },
18       "config": {
19            "objectLabelData": [...],
20            "actionLabelData": [...],
21            "skeletonTypeData": [...]
22       }}}

        The provided JSON encapsulates metadata crucial for video annotation and analysis. Within
     its structure, key parameters such as video dimensions, frame rate, and duration are outlined,
     essential for Vidat temporal analysis and processing (not reported in the example for simplic-
     ity). The inclusion of keyframe listings offers strategic markers for video segmentation and
     analysis, facilitating efficient data handling. Furthermore, the presence of object and region
     annotation maps anticipates future expansion into object detection and spatial characteriza-
     tion. The delineation of action annotation lists underscores the intention to annotate dynamic
     data. Particularly noteworthy is the skeleton annotation list, which furnishes detailed skeletal
     representations. The configuration segment provides an extensive catalog of object and action
     label data, coupled with skeleton-type specifications, forming the cornerstone for semantic
     understanding and classification in video content.
   Finally, since this JSON is aligned with the original video frames, it can be loaded into the
Vidat visual annotation tool. Our dance domain expert used the inferred human key-point labels
to add new dance move labels, supported by the already-generated dance poses. Each label
corresponds to a name (e.g., arabesque) and a time interval, representing the duration of the
step execution. After completing the label descriptions, the next step would normally amount to
the labeling of human movement frame by frame, but those were already labeled automatically
generated, so the domain expert only corrected minor interpolations or mismatches between
the skeleton and the video image. Finally, the dance domain experts annotated dance moves
linked with one or more inferred poses. The resulting JSON can be stored at any moment, and
will now include both skeleton and dance move label data. The outputs of such a process are
visually reported in Figure 4.


Figure 4: Action labeled with VIDAT


4. Discussion and Conclusion
The introduction of the DanXe framework represents a significant leap forward in digitizing and
analyzing dance heritage materials, offering promising capabilities for the automatic annotation
of archive videos. Supported by human oversight and augmented by XR technologies, the pro-
posed multi-modal, semi-automatic annotation framework signifies a substantial advancement
in cultural heritage conservation, especially in cases involving intangible heritage alongside
tangible assets. Given the unique nature of the analyzed case study (that of an archival collection
related to a dancer’s legacy), the annotations cannot be limited to just recognizing steps but
must also allow for tracking props, stage settings, and performers involved. This would enable
the preservation of all scenic components that contributed to a choreographic creation, ensuring
better preservation, facilitating restaging processes, and enriching the process of metadata
creation, which is typically limited to principal performers or even just the choreographer.
Providing a tool that can support the work of scholars and archivists, without replacing their
expertise but leveraging it to validate semi-automatic acquisitions, not only represents a valuable
contribution in expediting their work but also enriches the metadata associated with archival
sources, thus enabling user research. This approach promises to generate richer, more accurate
datasets, ultimately fostering a deeper understanding and appreciation of the art form.
Acknowledgments
This work was partly funded by: (i) the PNRR - M4C2 - Investimento 1.3, Partenariato Esteso
PE00000013 - “FAIR - Future Artificial Intelligence Research” - Spoke 8 “Pervasive AI”, funded
by the European Commission under the NextGeneration EU program.


References
 [1] J. Adshead-Lansdale, J. Layson, Dance history: An introduction, Routledge, 2006.
 [2] M. De Marinis, Il corpo dello spettatore. performance studies e nuova teatrologia, Sezione
     di Lettere (2014) 188–201.
 [3] E. Giannasca, Dance in the ontological perspective of a document theory of art, Danza e
     ricerca. laboratorio di studi, scritture, visioni 10 (2018) 325–346.
 [4] E. Randi, Primi appunti per un progetto di edizione critica coreica, SigMa-Rivista di
     Letterature comparate, Teatro e Arti dello spettacolo 4 (2020) 755–771.
 [5] S. Franco, Corpo-archivio: mappatura di una nozione tra incorporazione e pratica core-
     ografica (2019).
 [6] J. Kavanagh, Rudolf Nureyev: the life, Penguin UK, 2013.
 [7] K. El Raheb, Y. Ioannidis, Dance in the world of data and objects, in: International Confer-
     ence on Information Technologies for Performing Arts, Media Access, and Entertainment,
     Springer, 2013, pp. 192–204.
 [8] L. A. Naveda, M. Leman, Representation of samba dance gestures, using a multi-modal
     analysis approach, in: Enactive08, Edizione ETS, 2008, pp. 68–74.
 [9] N. Li, Q. Shen, R. Song, Y. Chi, H. Xu, Medukg: a deep-learning-based approach for
     multi-modal educational knowledge graph construction, Information 13 (2022) 91.
[10] L. Church, N. Rothwell, M. Downie, S. DeLahunta, A. F. Blackwell, Sketching by program-
     ming in the choreographic language agent., in: PPIG, 2012, p. 16.
[11] R. Li, J. Zhao, Y. Zhang, M. Su, Z. Ren, H. Zhang, Y. Tang, X. Li, Finedance: A fine-grained
     choreography dataset for 3d full body dance generation, in: Proceedings of the IEEE/CVF
     International Conference on Computer Vision, 2023, pp. 10234–10243.
[12] L. Stacchio, S. Garzarella, P. Cascarano, A. De Filippo, E. Cervellati, G. Marfia, Danxe: an
     extended artificial intelligence framework to analyze and promote dance heritage, Digital
     Applications in Archaeology and Cultural Heritage (2024) e00343.
[13] O. Alemi, J. Françoise, P. Pasquier, Groovenet: Real-time music-driven dance movement
     generation using artificial neural networks, networks 8 (2017) 26.
[14] T. Tang, J. Jia, H. Mao, Dance with melody: An lstm-autoencoder approach to music-
     oriented dance synthesis, in: Proceedings of the 26th ACM international conference on
     Multimedia, 2018, pp. 1598–1606.
[15] S. Bianco, G. Ciocca, P. Napoletano, R. Schettini, An interactive tool for manual, semi-
     automatic and automatic video annotation, Computer Vision and Image Understanding
     131 (2015) 88–99.
[16] L. Stacchio, A. Angeli, G. Lisanti, G. Marfia, Applying deep learning approaches to
     mixed quantitative-qualitative analyses, in: Proceedings of the 2022 ACM Conference on
     Information Technology for Social Good, 2022, pp. 161–166.
[17] H.-S. Fang, J. Li, H. Tang, C. Xu, H. Zhu, Y. Xiu, Y.-L. Li, C. Lu, Alphapose: Whole-body
     regional multi-person pose estimation and tracking in real-time, IEEE Transactions on
     Pattern Analysis and Machine Intelligence (2022).