-

H. Kato)

Experience Augmentation in Physical Therapy by Simulating Patient-Specific Walking Motions

Md Mustafizur Rahman

rahman.md_mustafizur.rp6@naist.ac.jp 1

Goshiro Yamamoto

goshiro@kuhp.kyoto-u.ac.jp 0 1

Chang Liu

liuchang@kuhp.kyoto-u.ac.jp 0

Hiroaki Ueshima

h_ueshima@kuhp.kyoto-u.ac.jp 0

Isidro Butaslac

isidro.b@naist.ac.jp 1

Taishi Sawabe

t.sawabe@is.naist.jp 1

Hirokazu Kato

kato@is.naist.jp 1 0 Kyoto University , 54 Kawahara-cho, Shogoin, Sakyo-ku, Kyoto 606-8507 , JAPAN 1 Nara Institute of Science and Technology (NAIST) , 8916-5 Takayama-cho, Ikoma, Nara 630-0192 , JAPAN

2024

000 0 0001

In physical therapy, understanding and analyzing patient movements, especially impaired gait patterns, is crucial for efective rehabilitation. Traditionally, trainee therapists acquire these skills through hands-on experience with real patients and textbooks. However, these methods are limited by the availability of patients and the variability of impaired motions that therapists can observe. To address these limitations, we propose a novel system that allows therapists to learn from a wide range of impaired gait motions without being restricted by time, place, or patient availability. This system utilizes the HumanML3D dataset and a two-step framework combining text2length sampling and text2motion generation. In the ifrst step, a classification model predicts motion length based on the input textual descriptions. For the second step, we use a temporal variational autoencoder (VAE) for generating varied and consistent 3D motion sequences. A key component of our approach is the utilization of residual vector quantization (RVQ) from the MoMask framework, which minimizes errors and enhances the precision of motion generation. Furthermore, a Masked Transformer ensures that the synthesized motion tokens are temporally consistent and contextually accurate. Our system, validated through the HumanML3D dataset, provides an immersive and interactive tool for physical therapists, enabling dynamic, patient-specific motion simulations in mixed reality environments. By bridging the gap between conventional methods and MR-assisted training, this approach uses interactive 3D representations to transform how therapists learn. It aims to revolutionize therapeutic training, making rehabilitation strategies more efective and personalized.

eol>Physical Therapists Motion Generation Therapeutic Training Rehabilitation Mixed Reality

Conventional (A) Therapist Ours

Observe in real environment Observe in

MR environment

Gait of real impaired patient Therapist’s HMD view (B) Therapist with HMD Generated gait of virtual impaired patient (C)

The therapist can repeatedly experience observing the virtual patient in front by looking through.

1. Introduction

and motion data being heterogeneous in many aspects.

With this, a number of attempts have been made in recent One of the main goals of stroke rehabilitation programs years, such as the use of an encoder with recurrent neural is the recovery of gait, which is often an important goal networks (RNNs)[ 5 ], variational autoencoders (VAE)[ 6 ], for patients as well. Post-stroke functional recovery typ- and transformer networks aiming to embed the language ically involves both natural processes and therapeutic and motion in the same space converting them into a interventions. While the majority of stroke survivors unified approach[ 7, 8 ]. Although these methods have regain the ability to walk, many fail to achieve suficient proved efective with small units of text, the downside endurance, speed, or stability to perform daily activities is that text of a larger length projecting complex ideas independently and safely. After a stroke, falls are still a does not produce good sequences of motion. Moreover, major problem for people who live in the community [ 1 ]. while existing difusion processes have shown efective

In conventional therapy training, trainee therapists ness for image generation and motion generation from rely primarily on textbooks and hands-on experience text descriptions[9, 10], it remains unclear whether such with real patients to understand impaired gait motion. improvements within one architecture come at a reasonMore recent studies used mixed reality (MR) to develop able cost compared to more traditional Vector Quantized the therapist’s ‘clinical eye,’ enhancing their assessment Variational Autoencoder (VQ-VAE) based approaches. through overlaid visualizations of patient data during re- In this work, we leverage the MoMask method inhabilitation [ 2, 3 ]. However, a limitation of this method troduced by Guo et al. [ 7 ], which combines hierarchiis the availability of patients restricts the variety and cal quantization with generative transformer models to frequency of learning opportunities, limiting exposure address the limitations of previous techniques. While to diferent types of impairments and hindering skill de- traditional methods like Residual Vector Quantization velopment. Building upon our prior work on a virtual (RVQ)[ 6 ] attempt to reduce quantization errors by embedreality (VR)-based medical training simulator [ 4 ], which ding motion tokens multiple times, MoMask ofers a more demonstrated the eficacy of immersive 3D modeling and advanced solution. As the first generative masked modrobotic systems in enhancing medical training and re- eling framework for text-to-motion generation, MoMask mote surgery, this study extends these principles to aug- features a hierarchical quantization generative model and ment therapeutic learning with patient-specific motion a dedicated mechanism for precise residual quantization, simulations in MR. In contrast to traditional methodolo- base token generation, and residual token prediction. Adgies, this research investigates how the incorporation ditionally, we integrate the HumanML3D[11] dataset, of MR-based simulations of impaired gait motion can which contains 14,616 annotated motion clips and 44,970 improve learning outcomes for therapists undergoing text descriptions, providing a comprehensive resource training by providing increased exposure to a variety of for generating and evaluating human motions. To facilgait impairments. itate seamless impaired humanoid motion retargeting,

As illustrated in Fig. 1, when provided with the in- we develop a headless Blender Python API script that put description, “a man walks forward with a noticeable enables mapping between diferent humanoid rigs and limp due to pain, favoring his right leg as he moves, his allows for local saving of bone mappings. Moreover, we steps are uneven, and his body tilts slightly with each implement a FastAPI backend that allows users to stream step, reflecting discomfort,” our system generates multi- data directly from Unity3D and use them for real-time huple unique three-dimensional (3D) impaired human mo- manoid animation visualization with an HMD, ensuring tions that closely correspond to the given textual input. smooth integration and display.

This approach significantly enhances traditional training methods by ofering immersive, repeatable learning experiences, leading to improved diagnostic accuracy 2. Related Work and therapeutic outcomes. The system aims to faithfully replicate a wide range of realistic 3D human motion dy- Existing work relating to our research mostly fall into namics that precisely adhere to the specified directions, domains of (2.1) 3D human motion generation, (2.2) textactions, timing, speed, and style described in the text. motion generation, and (2.3) language models and human

Applications in robotics, human-machine interface, motion captioning. and virtual content creation, among others, could be greatly impacted by this automated process. Making 2.1. 3D Human Motion Generation use of diferent approaches such as motion capture has its negative aspects which are the high costs and long Significant advancements have been made in 3D human time taken, therefore the automatic text to motion gen- motion generation, utilizing various approaches that eration is more feasible and cost efective. Despite this, leverage action learning, audio, and text inputs. Trasuch a task is quite dificult due to the nature of words ditional methods often employ a hidden state vector to

API EndPoint /gen_text2motion /download_fbx

Generate Text-to-Motion {.npy /.bvh format}

Retargeting (Transfer Animation from Source Rig to Destination Rig) Generate the FBX animations.

Pre-trained Model Checkpoints by MoMask generate sequential states. Basic approaches, such as errors. those by Cai et al. [12] and Wang et al. [13], utilized GAN algorithms to extend partial sequences with newly 2.3. Language Models and Human Motion lgiekneeYruateetdasl.ta[t1e4s]. eImnpcloonyterdasGt,CmNosrteoacdavpatnurceedthme estphaotidasl Captioning and temporal dynamics of human motion. Furthermore, The translation from natural language to human motion VAE and transformer-based models have been applied to have evolved from mathematical models [43] to advanced better capture temporal dependencies, as demonstrated neural networks like TM2T [36], which provides twoby Guo et al. [15, 16] and Petrovich et al. [17]. For way visualization between text and movement. Major audio-driven motion generation, techniques often trans- language models such as BERT [39], T5 [44], and Instructform acoustic features into human poses. Studies such as GPT [45] have pushed the boundaries of understanding Takeuchi et al. [18] utilized two-way LSTMs to generate across sectors. In multimodal learning, models like CLIP gestures from speech, while Shlizerman and Tang et al. [46] have linked images with text, inspiring similar ad[19, 20] investigated song and dance motion generation. vancements in human motion tasks, such as MotionCLIP Recent models, such as Lee et al. [21], also focused on [47]. Despite this progress, language models still remain the stochastic aspects of movement, which introduced underutilized in human motion tasks. Our research seeks uncertainty in dance movements. to integrate them into motion generation, leveraging pretrained models to create diverse motions. 2.2. Text-motion Generation Moreover, while existing work predominantly focuses on generating normal human motions, our system specifically targets the generation of impaired motions crucial for physical therapy training and rehabilitation.

Text-motion generation has become increasingly popular due to the ease of using natural language input. Previous studies [22, 23, 24, 25, 26] used mainly deterministic models, which typically average or blur the motion output.

More recent stochastic models, such as those in T2M[27] 3. Method and TEMOS[28], introduced more realism and variety into motion generation by using VAE structures and 3.1. System Overview transformers to provide the shared transition between The proposed system combines text-to-motion generaspeech and motion [29, 30, 31, 32, 33, 34]. Recent inno- tion, motion retargeting, and 3D animation export to vations, such as autoregressive models [ 35, 36, 8, 37, 38 ] create realistic human motion sequences from textual have gradually increased the quality of motion synthesis descriptions. As illustrated in Fig. 2, it features a backdramatically through denoising or motion suspension. end powered by a Python server using FastAPI and a Generative masked modeling inspired by BERT [39] have frontend in Unity3D, allowing therapists to interact with also been developed for human motion generation, using animations via a Meta Quest 3 headset. The backend protechniques such as residual quantization [40, 41, 42] to cesses input prompts with MoMask model checkpoints improve motion discretization and reduce quantization

Therapist with HMD Our system generates the impaired motion region based on the input description.

to generate patient-specific impaired motions, saved in 3.2. Text-to-Motion Generation ‘.npy’ or ‘.bvh’ format.

In general, our system is comprised of the following core Our system builds upon the state-of-the-art techniques components: for text-driven motion generation, particularly drawing

Backend: The backend features a FastAPI-based inspiration from the MoMask framework. The text-toPython server responsible for processing the motion gen- motion process is detailed below: eration pipeline. It utilizes pre-trained models from the Tokenization of Motion Sequences: The textual MoMask framework, which transform therapist input descriptions are transformed into a sequence of discrete prompts into impaired motion representations. This in- motion tokens using a vector quantization process. This cludes generating motion sequences that reflect specific process tokenizes complex human motion into a hierconditions or impairments, ensuring realistic and appli- archical structure of motion segments, each capturing cable outputs for therapeutic use. diferent facets of the described action.

Frontend: The frontend is built on the Unity3D en- Masked Motion Prediction: A Masked Transformer gine, which is employed to visualize the generated ani- is employed to predict masked motion tokens conditioned mations. It is designed to interface seamlessly with the on the input text. During the training phase, the model Meta Quest 3 headset, enabling immersive interaction for is trained to fill in randomly masked tokens from incomtherapists. This integration allows users to experience plete motion sequences. In the inference phase, it genthe animations in a three-dimensional space, providing a erates entire motion sequences by iteratively predicting more intuitive understanding of the impaired motions. missing tokens, ensuring global consistency and fidelity

API Endpoints: The system utilizes two key API end- to the input description. points to manage communication between the frontend Residual Refinement: After the base-layer motion and backend. The first endpoint, “/gen_text2motion”, is generated, a Residual Transformer is used to progrestakes the therapist’s input prompt and triggers the mo- sively refine the motion by predicting additional motion tion generation process. The backend processes the tokens that capture higher-order details. This step is prompt through the MoMask model, which translates crucial for enhancing the granularity and subtlety of the the text description into a motion representation in for- generated motion, ensuring fine control over aspects such mats like ‘.npy’ or ‘.bvh’. Once the motion is generated as posture and movement transitions. and retargeted, the second endpoint, “/download_fbx”, al- Motion Generation Output: The final output is a lows the frontend to retrieve the final FBX animation file. continuous 3D human motion sequence generated in This file is then used to visualize the impaired motions ‘.npy’ or ‘.bvh’ formats. These motion sequences reprein the MR interface. These API endpoints ensure smooth sent high-quality, realistic animations that can be further and eficient interaction between the components, allow- processed or directly visualized. ing the system to generate and deliver animations in real-time based on simple text input, thereby enhancing the rehabilitation experience for therapists.

(B) “A person stands up from the ground, walks in a clockwise circle, and then sits back on the ground.”

The motion matches the text description but both feet are bent.

The motion doesn't match the text description because the avatar's body is straight.

The motion matches the text description, and the avatar's body movement is also good. (A) “A man walks forward with a noticeable limp due to pain, favoring his right leg as he moves, his steps are uneven, and his body tilts slightly with each step, reflecting discomfort.” (C) “A man walks forward, they swing their left leg outward, causing their body to lean slightly to the right, after the left foot touches the ground, the right leg smoothly swings forward.” 3.3. Motion Retargeting and Animation In Fig. 3, the interactive process of generating impaired Transfer motion regions based on the therapist’s input description is demonstrated. The sequence of images illustrates a After generating the motion sequences, the system ap- physical therapist utilizing a head-mounted display to plies a motion retargeting process to map the generated observe the virtual patient’s motion in real-time. The leftmotion onto the target 3D model’s skeleton. This is done most frame captures the therapist’s perspective, showcasusing ‘keemap.rig.transfer’, a precision retargeting tool ing the virtual patient within a simulated environment. within Blender, which ensures accurate bone mapping The subsequent frames display the progression of the and preserves key motion attributes such as foot posi- patient’s movement, with impaired regions clearly hightioning and overall balance. By retargeting the motion, lighted to indicate abnormalities in motion. This system we ensure that the animations are properly transferred enables therapists to interact with and analyze the imto any humanoid rig, allowing seamless integration into paired motion sequences in real-time, providing an imvarious 3D models. mersive, hands-on approach that significantly enhances both diagnostic precision and therapeutic decision mak3.4. Export and Visualization in Unity ing.

Once the motion has been retargeted, it is exported in FBX format, which contains both the 3D model and its 4. Pilot Experiment associated animation. This format is then imported into Unity 3D, where additional adjustments to model posi- In the pilot experiment, we assessed the ability of our tioning and animation playback can be made. A custom text-to-motion model to generate impaired gait motions. Unity frontend was developed to provide an intuitive Fig. 4 highlights several outcomes from the experiment, interface where users, such as physical therapists, can in- showcasing both well-matched and mismatched animateract with and visualize the generated motion sequences tions. in real-time, enhancing their ability to analyze and eval- In well-matched animations Fig. 4(A), the motion uate generated impaired motions. aligned well with the text description. The avatar showed a noticeable limp and discomfort, with uneven steps and aimed to provide therapists with immersive and repeata tilted posture—accurately portraying the described im- able training experiences, leading to improved diagnostic pairment. In some animations Fig. 4(B), the generated accuracy and therapeutic outcomes. motion mostly matched the description, but the avatar’s Future work will focus on developing a dataset from bent feet introduced a small inconsistency, detracting EHR and conducting user studies to assess the system’s from realism. In mismatched animations taking Fig. 4(C) efectiveness in therapist training. Overall, our system as an example, although the left leg’s outward movement has the potential to significantly improve how therapists is captured, the avatar’s body remained too straight, fail- learn and analyze gait impairments. ing to show the expected slight lean.

The integration of the HumanML3D dataset, paired with pre-trained model checkpoints from MoMask, Acknowledgments greatly improved the quality of the generated motions.

These findings highlight the efectiveness of text-to- This work was supported by JSPS KAKENHI Grant Nummotion generation techniques. Additionally, the combi- ber JP23K24888. nation of Residual Vector Quantization-VAE (RVQ-VAE) and Transformer models contributed to the model’s abil- References ity to capture both coarse-grained and fine-grained motion details, further enhancing the fidelity and accuracy of the animations.

5. Future Work Looking ahead, we plan to broaden our research by devel

oping a custom dataset composed of textual descriptions extracted from the Electronic Health Records (EHRs). This domain-specific dataset will enable the model to generate motions that are more relevant to medical and therapeutic applications. Once this dataset is constructed, we will retrain our model to improve its performance in these specialized contexts.

Currently, we have not tested the efectiveness of the virtual motions in MR environments, which is another key feature of our system. In the future, we also intend to conduct more extensive user studies with a large cohort of therapists to evaluate the long-term impact of using our MR system on their training outcomes. Furthermore, we will incorporate user feedback to refine the user interface and enhance the system’s overall functionality.

Finally, we aim to develop additional features within the MR environment, such as the simulation of real-world therapeutic scenarios and the ability to track the therapists’ performance over time. These advancements will ensure that our system continues to be a valuable tool for therapist training and ultimately contributes to improved patient care.

6. Conclusion This study have introduced a system for enhancing phys

ical therapy training through MR simulations of patientspecific walking motions. By utilizing the HumanML3D dataset and advanced techniques like RVQ and Masked Transformers, the system generates realistic impaired gait patterns from textual descriptions. The system is Advances in Neural Information Processing Sys- pattern recognition, 2018, pp. 7574–7583. tems 36 (2024). [20] T. Tang, J. Jia, H. Mao, Dance with melody: An lstm[9] C. Ahuja, L.-P. Morency, Language2pose: Natural autoencoder approach to music-oriented dance synlanguage grounded pose forecasting, in: 2019 Inter- thesis, in: Proceedings of the 26th ACM internanational Conference on 3D Vision, IEEE, 2019, pp. tional conference on Multimedia, 2018, pp. 1598– 719–728. 1606. [10] U. Bhattacharya, N. Rewkowski, A. Banerjee, [21] H.-Y. Lee, X. Yang, M.-Y. Liu, T.-C. Wang, Y.-D. Lu, P. Guhan, A. Bera, D. Manocha, Text2gestures: A M.-H. Yang, J. Kautz, Dancing to music, Advances transformer-based network for generating emotive in neural information processing systems 32 (2019). body gestures for virtual agents, in: 2021 IEEE vir- [22] R. Huang, H. Hu, W. Wu, K. Sawada, M. Zhang, tual reality and 3D user interfaces (VR), IEEE, 2021, D. Jiang, Dance revolution: Long-term dance genpp. 1–10. eration with music via curriculum learning, arXiv [11] C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, preprint arXiv:2006.06119 (2020).

L. Cheng, Generating diverse and natural 3d human [23] A. Ghosh, N. Cheema, C. Oguz, C. Theobalt, motions from text, in: Proceedings of the IEEE/CVF P. Slusallek, Synthesis of compositional animations Conference on Computer Vision and Pattern Recog- from textual descriptions, in: Proceedings of the nition, 2022, pp. 5152–5161. IEEE/CVF international conference on computer [12] H. Cai, C. Bai, Y.-W. Tai, C.-K. Tang, Deep video vision, 2021, pp. 1396–1406.

generation, prediction and completion of human [24] A. S. Lin, L. Wu, R. Corona, K. Tai, Q. Huang, R. J. action sequences, in: Proceedings of the European Mooney, Generating animated videos of human acconference on computer vision (ECCV), 2018, pp. tivities from natural language descriptions, Learn366–382. ing 1 (2018) 1. [13] Z. Wang, P. Yu, Y. Zhao, R. Zhang, Y. Zhou, J. Yuan, [25] M. Plappert, C. Mandery, T. Asfour, Learning a C. Chen, Learning diverse stochastic human-action bidirectional mapping between human whole-body generators by learning smooth latent transitions, motion and natural language using deep recurrent in: Proceedings of the AAAI conference on artificial neural networks, Robotics and Autonomous Sysintelligence, volume 34, 2020, pp. 12281–12288. tems 109 (2018) , pp. 13–26. [14] P. Yu, Y. Zhao, C. Li, J. Yuan, C. Chen, Structure- [26] C. Ahuja, L.-P. Morency, Language2pose: Natural aware human-action generation, in: Computer language grounded pose forecasting, in: 2019 InterVision–ECCV 2020: 16th European Conference, national Conference on 3D Vision, IEEE, 2019, pp. , Glasgow, UK, August 23–28, 2020, Proceedings, Part pp. 719–728.

XXX 16, Springer, 2020, pp. 18–34. [27] C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, [15] C. Guo, X. Zuo, S. Wang, X. Liu, S. Zou, M. Gong, L. Cheng, Generating diverse and natural 3d human L. Cheng, Action2video: Generating videos of hu- motions from text, in: Proceedings of the IEEE/CVF man 3d actions, International Journal of Computer Conference on Computer Vision and Pattern RecogVision 130 (2022) , pp. 285–315. nition, 2022, pp. 5152–5161. [16] C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, [28] M. Petrovich, M. J. Black, G. Varol, Temos: GenM. Gong, L. Cheng, Action2motion: Conditioned erating diverse human motions from textual degeneration of 3d human motions, in: Proceedings scriptions, in: European Conference on Computer of the 28th ACM International Conference on Mul- Vision, Springer, 2022, pp. 480–497. timedia, 2020, pp. 2021–2029. [29] X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, [17] M. Petrovich, M. J. Black, G. Varol, Action- G. Yu, Executing your commands via motion difconditioned 3d human motion synthesis with trans- fusion in latent space, in: Proceedings of the former vae, in: Proceedings of the IEEE/CVF Inter- IEEE/CVF Conference on Computer Vision and Patnational Conference on Computer Vision, 2021, pp. tern Recognition, 2023, pp. 18000–18010. 10985–10995. [30] J. Kim, J. Kim, S. Choi, Flame: Free-form language[18] K. Takeuchi, D. Hasegawa, S. Shirakawa, N. Kaneko, based motion synthesis & editing, in: Proceedings H. Sakuta, K. Sumi, Speech-to-gesture generation: of the AAAI Conference on Artificial Intelligence, A challenge in deep learning approach with bi- volume 37, 2023, pp. 8255–8263. directional lstm, in: Proceedings of the 5th Inter- [31] H. Kong, K. Gong, D. Lian, M. B. Mi, X. Wang, national Conference on Human Agent Interaction, Priority-centric human motion generation in dis2017, pp. 365–369. crete latent space, in: Proceedings of the IEEE/CVF [19] E. Shlizerman, L. Dery, H. Schoen, I. Kemelmacher- International Conference on Computer Vision, Shlizerman, Audio to body dynamics, in: Proceed- 2023, pp. 14806–14816. ings of the IEEE conference on computer vision and [32] Y. Lou, L. Zhu, Y. Wang, X. Wang, Y. Yang, Diversemotion: Towards diverse human motion gen- limits of transfer learning with a unified text-to-text eration via discrete difusion, arXiv preprint transformer, Journal of machine learning research arXiv:2309.01372 (2023). 21 (2020) , pp.1–67. [33] J. Tseng, R. Castellon, K. Liu, Edge: Editable dance [45] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Waingeneration from music, in: Proceedings of the wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, IEEE/CVF Conference on Computer Vision and Pat- A. Ray, et al., Training language models to follow tern Recognition, 2023, pp. 448–458. instructions with human feedback, Advances in [34] M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, neural information processing systems 35 (2022) , Z. Liu, Motiondifuse: Text-driven human motion pp. 27730–27744. generation with difusion model, arXiv preprint [46] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, arXiv:2208.15001 (2022). G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, [35] K. Gong, D. Lian, H. Chang, C. Guo, Z. Jiang, X. Zuo, J. Clark, et al., Learning transferable visual modM. B. Mi, X. Wang, Tm2d: Bimodality driven 3d els from natural language supervision, in: Interdance generation via music-text integration, in: national conference on machine learning, PMLR, Proceedings of the IEEE/CVF International Confer- 2021, pp. 8748–8763.

ence on Computer Vision, 2023, pp. 9942–9952. [47] G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, [36] C. Guo, X. Zuo, S. Wang, L. Cheng, Tm2t: Stochastic D. Cohen-Or, Motionclip: Exposing human motion and tokenized modeling for the reciprocal genera- generation to clip space, in: European Conference tion of 3d human motions and texts, in: European on Computer Vision, Springer, 2022, pp. 358–374. Conference on Computer Vision, Springer, 2022, pp.

580–597. [37] J. Zhang, Y. Zhang, X. Cun, Y. Zhang, H. Zhao, H. Lu,

X. Shen, Y. Shan, Generating human motion from textual descriptions with discrete representations, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp.

14730–14740. [38] Z. Zhou, B. Wang, Ude: A unified driving engine for human motion generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, 2023, pp. 5632–5641. [39] J. Devlin, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [40] Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov,

O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, et al., Audiolm: a language modeling approach to audio generation, IEEE/ACM transactions on audio, speech, and language processing 31 (2023) , pp. 2523–2533. [41] J. Martinez, H. H. Hoos, J. J. Little, Stacked quantizers for compositional vector compression, arXiv preprint arXiv:1411.2173 (2014). [42] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund,

M. Tagliasacchi, Soundstream: An end-to-end neural audio codec, IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021) , pp.

495–507. [43] W. Takano, Y. Nakamura, Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions, The International Journal of Robotics Research 34 (2015) , pp.

1314–1328. [44] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang,

M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the

[1] J.-M. Belda-Lois , S. Mena-del Horno , I. BermejoBosch ,

J. C.

Moreno ,

J. L.

Pons ,

Farina ,

Iosa ,

Molinari ,

Tamburella ,

Ramos , et al., Rehabilitation of gait after stroke: a review towards a top-down approach , Journal of neuroengineering and rehabilitation 8 ( 2011 ) , pp. 1 - 20 .

[2] M. De Cecco , A.

Luchetti , I.

Butaslac , F.

Pilla , G. M. A.

Guandalini , J.

Bonavita , M.

Mazzucato , K.

Hirokazu , Sharing augmented reality between a patient and a clinician for assessment and rehabilitation in daily living activities , Information 14 ( 2023 ) , Art. No. 204 .

[3]

Luchetti , I. Butaslac,

Rosi ,

Fruet , G. Nollo,

P. G.

Ianes ,

Pilla ,

Gasperini ,

G. M.

Achille Guandalini ,

Bonavita ,

Kato , M. De Cecco , Multidimensional assessment of daily living activities in a shared augmented reality environment , in: 2022 IEEE International Workshop on Metrology for Living Environment , 2022 , pp. 60 - 65 .

[4] M. M. Rahman , M. F.

Ishmam , M. T.

Hossain , M. E.

Haque , Virtual reality based medical training simulator and robotic operation system , in: 2022 International Conference on Recent Progresses in Science, Engineering and Technology (ICRPSET) , IEEE, 2022 , pp. 1 - 4 .

[5]

L. R.

Medsker ,

Jain , et al., Recurrent neural networks, Design and Applications 5 ( 2001 ) , pp. 2 .

[6]

D. P.

Kingma ,

Welling , et al., An introduction to variational autoencoders , Foundations and Trends® in Machine Learning 12 ( 2019 ) , pp. 307 - 392 .

[7]

Guo ,

Mu ,

M. G.

Javed ,

Wang , L. Cheng, Momask: Generative masked modeling of 3d human motions , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024 , pp. 1900 - 1910 .

[8]

Jiang ,

Chen , W. Liu,

Yu ,

Chen , Motiongpt: Human motion as a foreign language,