<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>H. Kato)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Experience Augmentation in Physical Therapy by Simulating Patient-Specific Walking Motions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Md Mustafizur Rahman</string-name>
          <email>rahman.md_mustafizur.rp6@naist.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Goshiro Yamamoto</string-name>
          <email>goshiro@kuhp.kyoto-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chang Liu</string-name>
          <email>liuchang@kuhp.kyoto-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hiroaki Ueshima</string-name>
          <email>h_ueshima@kuhp.kyoto-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isidro Butaslac</string-name>
          <email>isidro.b@naist.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Taishi Sawabe</string-name>
          <email>t.sawabe@is.naist.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hirokazu Kato</string-name>
          <email>kato@is.naist.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kyoto University</institution>
          ,
          <addr-line>54 Kawahara-cho, Shogoin, Sakyo-ku, Kyoto 606-8507</addr-line>
          ,
          <country country="JP">JAPAN</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Nara Institute of Science and Technology (NAIST)</institution>
          ,
          <addr-line>8916-5 Takayama-cho, Ikoma, Nara 630-0192</addr-line>
          ,
          <country country="JP">JAPAN</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In physical therapy, understanding and analyzing patient movements, especially impaired gait patterns, is crucial for efective rehabilitation. Traditionally, trainee therapists acquire these skills through hands-on experience with real patients and textbooks. However, these methods are limited by the availability of patients and the variability of impaired motions that therapists can observe. To address these limitations, we propose a novel system that allows therapists to learn from a wide range of impaired gait motions without being restricted by time, place, or patient availability. This system utilizes the HumanML3D dataset and a two-step framework combining text2length sampling and text2motion generation. In the ifrst step, a classification model predicts motion length based on the input textual descriptions. For the second step, we use a temporal variational autoencoder (VAE) for generating varied and consistent 3D motion sequences. A key component of our approach is the utilization of residual vector quantization (RVQ) from the MoMask framework, which minimizes errors and enhances the precision of motion generation. Furthermore, a Masked Transformer ensures that the synthesized motion tokens are temporally consistent and contextually accurate. Our system, validated through the HumanML3D dataset, provides an immersive and interactive tool for physical therapists, enabling dynamic, patient-specific motion simulations in mixed reality environments. By bridging the gap between conventional methods and MR-assisted training, this approach uses interactive 3D representations to transform how therapists learn. It aims to revolutionize therapeutic training, making rehabilitation strategies more efective and personalized.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Physical Therapists</kwd>
        <kwd>Motion Generation</kwd>
        <kwd>Therapeutic Training</kwd>
        <kwd>Rehabilitation</kwd>
        <kwd>Mixed Reality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Conventional
(A) Therapist
Ours</p>
      <p>Observe
in real
environment
Observe in</p>
      <p>MR
environment</p>
      <p>Gait of real impaired patient
Therapist’s HMD view
(B) Therapist with HMD
Generated gait of virtual impaired patient
(C)</p>
      <p>The therapist can repeatedly experience observing
the virtual patient in front by looking through.</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>and motion data being heterogeneous in many aspects.</p>
      <p>
        With this, a number of attempts have been made in recent
One of the main goals of stroke rehabilitation programs years, such as the use of an encoder with recurrent neural
is the recovery of gait, which is often an important goal networks (RNNs)[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], variational autoencoders (VAE)[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
for patients as well. Post-stroke functional recovery typ- and transformer networks aiming to embed the language
ically involves both natural processes and therapeutic and motion in the same space converting them into a
interventions. While the majority of stroke survivors unified approach[
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Although these methods have
regain the ability to walk, many fail to achieve suficient proved efective with small units of text, the downside
endurance, speed, or stability to perform daily activities is that text of a larger length projecting complex ideas
independently and safely. After a stroke, falls are still a does not produce good sequences of motion. Moreover,
major problem for people who live in the community [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. while existing difusion processes have shown
efective
      </p>
      <p>
        In conventional therapy training, trainee therapists ness for image generation and motion generation from
rely primarily on textbooks and hands-on experience text descriptions[9, 10], it remains unclear whether such
with real patients to understand impaired gait motion. improvements within one architecture come at a
reasonMore recent studies used mixed reality (MR) to develop able cost compared to more traditional Vector Quantized
the therapist’s ‘clinical eye,’ enhancing their assessment Variational Autoencoder (VQ-VAE) based approaches.
through overlaid visualizations of patient data during re- In this work, we leverage the MoMask method
inhabilitation [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. However, a limitation of this method troduced by Guo et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which combines
hierarchiis the availability of patients restricts the variety and cal quantization with generative transformer models to
frequency of learning opportunities, limiting exposure address the limitations of previous techniques. While
to diferent types of impairments and hindering skill de- traditional methods like Residual Vector Quantization
velopment. Building upon our prior work on a virtual (RVQ)[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] attempt to reduce quantization errors by
embedreality (VR)-based medical training simulator [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which ding motion tokens multiple times, MoMask ofers a more
demonstrated the eficacy of immersive 3D modeling and advanced solution. As the first generative masked
modrobotic systems in enhancing medical training and re- eling framework for text-to-motion generation, MoMask
mote surgery, this study extends these principles to aug- features a hierarchical quantization generative model and
ment therapeutic learning with patient-specific motion a dedicated mechanism for precise residual quantization,
simulations in MR. In contrast to traditional methodolo- base token generation, and residual token prediction.
Adgies, this research investigates how the incorporation ditionally, we integrate the HumanML3D[11] dataset,
of MR-based simulations of impaired gait motion can which contains 14,616 annotated motion clips and 44,970
improve learning outcomes for therapists undergoing text descriptions, providing a comprehensive resource
training by providing increased exposure to a variety of for generating and evaluating human motions. To
facilgait impairments. itate seamless impaired humanoid motion retargeting,
      </p>
      <p>As illustrated in Fig. 1, when provided with the in- we develop a headless Blender Python API script that
put description, “a man walks forward with a noticeable enables mapping between diferent humanoid rigs and
limp due to pain, favoring his right leg as he moves, his allows for local saving of bone mappings. Moreover, we
steps are uneven, and his body tilts slightly with each implement a FastAPI backend that allows users to stream
step, reflecting discomfort,” our system generates multi- data directly from Unity3D and use them for real-time
huple unique three-dimensional (3D) impaired human mo- manoid animation visualization with an HMD, ensuring
tions that closely correspond to the given textual input. smooth integration and display.</p>
      <p>This approach significantly enhances traditional
training methods by ofering immersive, repeatable learning
experiences, leading to improved diagnostic accuracy 2. Related Work
and therapeutic outcomes. The system aims to faithfully
replicate a wide range of realistic 3D human motion dy- Existing work relating to our research mostly fall into
namics that precisely adhere to the specified directions, domains of (2.1) 3D human motion generation, (2.2)
textactions, timing, speed, and style described in the text. motion generation, and (2.3) language models and human</p>
      <p>Applications in robotics, human-machine interface, motion captioning.
and virtual content creation, among others, could be
greatly impacted by this automated process. Making 2.1. 3D Human Motion Generation
use of diferent approaches such as motion capture has
its negative aspects which are the high costs and long Significant advancements have been made in 3D human
time taken, therefore the automatic text to motion gen- motion generation, utilizing various approaches that
eration is more feasible and cost efective. Despite this, leverage action learning, audio, and text inputs.
Trasuch a task is quite dificult due to the nature of words ditional methods often employ a hidden state vector to</p>
      <p>API EndPoint
/gen_text2motion
/download_fbx</p>
      <p>Generate Text-to-Motion
{.npy /.bvh format}</p>
      <p>Retargeting
(Transfer Animation from
Source Rig to Destination
Rig)
Generate the FBX
animations.</p>
      <p>Pre-trained Model
Checkpoints by
MoMask
generate sequential states. Basic approaches, such as errors.
those by Cai et al. [12] and Wang et al. [13], utilized
GAN algorithms to extend partial sequences with newly 2.3. Language Models and Human Motion
lgiekneeYruateetdasl.ta[t1e4s]. eImnpcloonyterdasGt,CmNosrteoacdavpatnurceedthme estphaotidasl Captioning
and temporal dynamics of human motion. Furthermore, The translation from natural language to human motion
VAE and transformer-based models have been applied to have evolved from mathematical models [43] to advanced
better capture temporal dependencies, as demonstrated neural networks like TM2T [36], which provides
twoby Guo et al. [15, 16] and Petrovich et al. [17]. For way visualization between text and movement. Major
audio-driven motion generation, techniques often trans- language models such as BERT [39], T5 [44], and
Instructform acoustic features into human poses. Studies such as GPT [45] have pushed the boundaries of understanding
Takeuchi et al. [18] utilized two-way LSTMs to generate across sectors. In multimodal learning, models like CLIP
gestures from speech, while Shlizerman and Tang et al. [46] have linked images with text, inspiring similar
ad[19, 20] investigated song and dance motion generation. vancements in human motion tasks, such as MotionCLIP
Recent models, such as Lee et al. [21], also focused on [47]. Despite this progress, language models still remain
the stochastic aspects of movement, which introduced underutilized in human motion tasks. Our research seeks
uncertainty in dance movements. to integrate them into motion generation, leveraging
pretrained models to create diverse motions.
2.2. Text-motion Generation Moreover, while existing work predominantly focuses
on generating normal human motions, our system
specifically targets the generation of impaired motions crucial
for physical therapy training and rehabilitation.</p>
      <p>Text-motion generation has become increasingly popular
due to the ease of using natural language input. Previous
studies [22, 23, 24, 25, 26] used mainly deterministic
models, which typically average or blur the motion output.</p>
      <p>
        More recent stochastic models, such as those in T2M[27] 3. Method
and TEMOS[28], introduced more realism and variety
into motion generation by using VAE structures and 3.1. System Overview
transformers to provide the shared transition between The proposed system combines text-to-motion
generaspeech and motion [29, 30, 31, 32, 33, 34]. Recent inno- tion, motion retargeting, and 3D animation export to
vations, such as autoregressive models [
        <xref ref-type="bibr" rid="ref8">35, 36, 8, 37, 38</xref>
        ] create realistic human motion sequences from textual
have gradually increased the quality of motion synthesis descriptions. As illustrated in Fig. 2, it features a
backdramatically through denoising or motion suspension. end powered by a Python server using FastAPI and a
Generative masked modeling inspired by BERT [39] have frontend in Unity3D, allowing therapists to interact with
also been developed for human motion generation, using animations via a Meta Quest 3 headset. The backend
protechniques such as residual quantization [40, 41, 42] to cesses input prompts with MoMask model checkpoints
improve motion discretization and reduce quantization
      </p>
      <sec id="sec-2-1">
        <title>Therapist with HMD</title>
      </sec>
      <sec id="sec-2-2">
        <title>Our system generates the impaired motion region based on the input description.</title>
        <p>to generate patient-specific impaired motions, saved in 3.2. Text-to-Motion Generation
‘.npy’ or ‘.bvh’ format.</p>
        <p>In general, our system is comprised of the following core Our system builds upon the state-of-the-art techniques
components: for text-driven motion generation, particularly drawing</p>
        <p>Backend: The backend features a FastAPI-based inspiration from the MoMask framework. The
text-toPython server responsible for processing the motion gen- motion process is detailed below:
eration pipeline. It utilizes pre-trained models from the Tokenization of Motion Sequences: The textual
MoMask framework, which transform therapist input descriptions are transformed into a sequence of discrete
prompts into impaired motion representations. This in- motion tokens using a vector quantization process. This
cludes generating motion sequences that reflect specific process tokenizes complex human motion into a
hierconditions or impairments, ensuring realistic and appli- archical structure of motion segments, each capturing
cable outputs for therapeutic use. diferent facets of the described action.</p>
        <p>Frontend: The frontend is built on the Unity3D en- Masked Motion Prediction: A Masked Transformer
gine, which is employed to visualize the generated ani- is employed to predict masked motion tokens conditioned
mations. It is designed to interface seamlessly with the on the input text. During the training phase, the model
Meta Quest 3 headset, enabling immersive interaction for is trained to fill in randomly masked tokens from
incomtherapists. This integration allows users to experience plete motion sequences. In the inference phase, it
genthe animations in a three-dimensional space, providing a erates entire motion sequences by iteratively predicting
more intuitive understanding of the impaired motions. missing tokens, ensuring global consistency and fidelity</p>
        <p>API Endpoints: The system utilizes two key API end- to the input description.
points to manage communication between the frontend Residual Refinement: After the base-layer motion
and backend. The first endpoint, “/gen_text2motion”, is generated, a Residual Transformer is used to
progrestakes the therapist’s input prompt and triggers the mo- sively refine the motion by predicting additional motion
tion generation process. The backend processes the tokens that capture higher-order details. This step is
prompt through the MoMask model, which translates crucial for enhancing the granularity and subtlety of the
the text description into a motion representation in for- generated motion, ensuring fine control over aspects such
mats like ‘.npy’ or ‘.bvh’. Once the motion is generated as posture and movement transitions.
and retargeted, the second endpoint, “/download_fbx”, al- Motion Generation Output: The final output is a
lows the frontend to retrieve the final FBX animation file. continuous 3D human motion sequence generated in
This file is then used to visualize the impaired motions ‘.npy’ or ‘.bvh’ formats. These motion sequences
reprein the MR interface. These API endpoints ensure smooth sent high-quality, realistic animations that can be further
and eficient interaction between the components, allow- processed or directly visualized.
ing the system to generate and deliver animations in
real-time based on simple text input, thereby enhancing
the rehabilitation experience for therapists.</p>
        <p>(B) “A person stands up from the ground, walks in a clockwise
circle, and then sits back on the ground.”</p>
        <p>The motion matches the
text description but both
feet are bent.</p>
        <p>The motion doesn't match
the text description
because the avatar's body
is straight.</p>
        <p>The motion matches the text
description, and the avatar's
body movement is also good.
(A) “A man walks forward with a noticeable
limp due to pain, favoring his right leg as he
moves, his steps are uneven, and his body tilts
slightly with each step, reflecting discomfort.”
(C) “A man walks forward, they swing their left leg outward,
causing their body to lean slightly to the right, after the left foot
touches the ground, the right leg smoothly swings forward.”
3.3. Motion Retargeting and Animation In Fig. 3, the interactive process of generating impaired
Transfer motion regions based on the therapist’s input description
is demonstrated. The sequence of images illustrates a
After generating the motion sequences, the system ap- physical therapist utilizing a head-mounted display to
plies a motion retargeting process to map the generated observe the virtual patient’s motion in real-time. The
leftmotion onto the target 3D model’s skeleton. This is done most frame captures the therapist’s perspective,
showcasusing ‘keemap.rig.transfer’, a precision retargeting tool ing the virtual patient within a simulated environment.
within Blender, which ensures accurate bone mapping The subsequent frames display the progression of the
and preserves key motion attributes such as foot posi- patient’s movement, with impaired regions clearly
hightioning and overall balance. By retargeting the motion, lighted to indicate abnormalities in motion. This system
we ensure that the animations are properly transferred enables therapists to interact with and analyze the
imto any humanoid rig, allowing seamless integration into paired motion sequences in real-time, providing an
imvarious 3D models. mersive, hands-on approach that significantly enhances
both diagnostic precision and therapeutic decision
mak3.4. Export and Visualization in Unity ing.</p>
        <p>Once the motion has been retargeted, it is exported in
FBX format, which contains both the 3D model and its 4. Pilot Experiment
associated animation. This format is then imported into
Unity 3D, where additional adjustments to model posi- In the pilot experiment, we assessed the ability of our
tioning and animation playback can be made. A custom text-to-motion model to generate impaired gait motions.
Unity frontend was developed to provide an intuitive Fig. 4 highlights several outcomes from the experiment,
interface where users, such as physical therapists, can in- showcasing both well-matched and mismatched
animateract with and visualize the generated motion sequences tions.
in real-time, enhancing their ability to analyze and eval- In well-matched animations Fig. 4(A), the motion
uate generated impaired motions. aligned well with the text description. The avatar showed
a noticeable limp and discomfort, with uneven steps and aimed to provide therapists with immersive and
repeata tilted posture—accurately portraying the described im- able training experiences, leading to improved diagnostic
pairment. In some animations Fig. 4(B), the generated accuracy and therapeutic outcomes.
motion mostly matched the description, but the avatar’s Future work will focus on developing a dataset from
bent feet introduced a small inconsistency, detracting EHR and conducting user studies to assess the system’s
from realism. In mismatched animations taking Fig. 4(C) efectiveness in therapist training. Overall, our system
as an example, although the left leg’s outward movement has the potential to significantly improve how therapists
is captured, the avatar’s body remained too straight, fail- learn and analyze gait impairments.
ing to show the expected slight lean.</p>
        <p>The integration of the HumanML3D dataset, paired
with pre-trained model checkpoints from MoMask, Acknowledgments
greatly improved the quality of the generated motions.</p>
        <p>These findings highlight the efectiveness of text-to- This work was supported by JSPS KAKENHI Grant
Nummotion generation techniques. Additionally, the combi- ber JP23K24888.
nation of Residual Vector Quantization-VAE (RVQ-VAE)
and Transformer models contributed to the model’s abil- References
ity to capture both coarse-grained and fine-grained
motion details, further enhancing the fidelity and accuracy
of the animations.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Future Work</title>
      <sec id="sec-3-1">
        <title>Looking ahead, we plan to broaden our research by devel</title>
        <p>oping a custom dataset composed of textual descriptions
extracted from the Electronic Health Records (EHRs).
This domain-specific dataset will enable the model to
generate motions that are more relevant to medical and
therapeutic applications. Once this dataset is constructed,
we will retrain our model to improve its performance in
these specialized contexts.</p>
        <p>Currently, we have not tested the efectiveness of the
virtual motions in MR environments, which is another
key feature of our system. In the future, we also intend to
conduct more extensive user studies with a large cohort
of therapists to evaluate the long-term impact of using
our MR system on their training outcomes. Furthermore,
we will incorporate user feedback to refine the user
interface and enhance the system’s overall functionality.</p>
        <p>Finally, we aim to develop additional features within
the MR environment, such as the simulation of real-world
therapeutic scenarios and the ability to track the
therapists’ performance over time. These advancements will
ensure that our system continues to be a valuable tool for
therapist training and ultimately contributes to improved
patient care.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusion</title>
      <sec id="sec-4-1">
        <title>This study have introduced a system for enhancing phys</title>
        <p>ical therapy training through MR simulations of
patientspecific walking motions. By utilizing the HumanML3D
dataset and advanced techniques like RVQ and Masked
Transformers, the system generates realistic impaired
gait patterns from textual descriptions. The system is
Advances in Neural Information Processing Sys- pattern recognition, 2018, pp. 7574–7583.
tems 36 (2024). [20] T. Tang, J. Jia, H. Mao, Dance with melody: An
lstm[9] C. Ahuja, L.-P. Morency, Language2pose: Natural autoencoder approach to music-oriented dance
synlanguage grounded pose forecasting, in: 2019 Inter- thesis, in: Proceedings of the 26th ACM
internanational Conference on 3D Vision, IEEE, 2019, pp. tional conference on Multimedia, 2018, pp. 1598–
719–728. 1606.
[10] U. Bhattacharya, N. Rewkowski, A. Banerjee, [21] H.-Y. Lee, X. Yang, M.-Y. Liu, T.-C. Wang, Y.-D. Lu,
P. Guhan, A. Bera, D. Manocha, Text2gestures: A M.-H. Yang, J. Kautz, Dancing to music, Advances
transformer-based network for generating emotive in neural information processing systems 32 (2019).
body gestures for virtual agents, in: 2021 IEEE vir- [22] R. Huang, H. Hu, W. Wu, K. Sawada, M. Zhang,
tual reality and 3D user interfaces (VR), IEEE, 2021, D. Jiang, Dance revolution: Long-term dance
genpp. 1–10. eration with music via curriculum learning, arXiv
[11] C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, preprint arXiv:2006.06119 (2020).</p>
        <p>L. Cheng, Generating diverse and natural 3d human [23] A. Ghosh, N. Cheema, C. Oguz, C. Theobalt,
motions from text, in: Proceedings of the IEEE/CVF P. Slusallek, Synthesis of compositional animations
Conference on Computer Vision and Pattern Recog- from textual descriptions, in: Proceedings of the
nition, 2022, pp. 5152–5161. IEEE/CVF international conference on computer
[12] H. Cai, C. Bai, Y.-W. Tai, C.-K. Tang, Deep video vision, 2021, pp. 1396–1406.</p>
        <p>generation, prediction and completion of human [24] A. S. Lin, L. Wu, R. Corona, K. Tai, Q. Huang, R. J.
action sequences, in: Proceedings of the European Mooney, Generating animated videos of human
acconference on computer vision (ECCV), 2018, pp. tivities from natural language descriptions,
Learn366–382. ing 1 (2018) 1.
[13] Z. Wang, P. Yu, Y. Zhao, R. Zhang, Y. Zhou, J. Yuan, [25] M. Plappert, C. Mandery, T. Asfour, Learning a
C. Chen, Learning diverse stochastic human-action bidirectional mapping between human whole-body
generators by learning smooth latent transitions, motion and natural language using deep recurrent
in: Proceedings of the AAAI conference on artificial neural networks, Robotics and Autonomous
Sysintelligence, volume 34, 2020, pp. 12281–12288. tems 109 (2018) , pp. 13–26.
[14] P. Yu, Y. Zhao, C. Li, J. Yuan, C. Chen, Structure- [26] C. Ahuja, L.-P. Morency, Language2pose: Natural
aware human-action generation, in: Computer language grounded pose forecasting, in: 2019
InterVision–ECCV 2020: 16th European Conference, national Conference on 3D Vision, IEEE, 2019, pp. ,
Glasgow, UK, August 23–28, 2020, Proceedings, Part pp. 719–728.</p>
        <p>XXX 16, Springer, 2020, pp. 18–34. [27] C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li,
[15] C. Guo, X. Zuo, S. Wang, X. Liu, S. Zou, M. Gong, L. Cheng, Generating diverse and natural 3d human
L. Cheng, Action2video: Generating videos of hu- motions from text, in: Proceedings of the IEEE/CVF
man 3d actions, International Journal of Computer Conference on Computer Vision and Pattern
RecogVision 130 (2022) , pp. 285–315. nition, 2022, pp. 5152–5161.
[16] C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, [28] M. Petrovich, M. J. Black, G. Varol, Temos:
GenM. Gong, L. Cheng, Action2motion: Conditioned erating diverse human motions from textual
degeneration of 3d human motions, in: Proceedings scriptions, in: European Conference on Computer
of the 28th ACM International Conference on Mul- Vision, Springer, 2022, pp. 480–497.
timedia, 2020, pp. 2021–2029. [29] X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen,
[17] M. Petrovich, M. J. Black, G. Varol, Action- G. Yu, Executing your commands via motion
difconditioned 3d human motion synthesis with trans- fusion in latent space, in: Proceedings of the
former vae, in: Proceedings of the IEEE/CVF Inter- IEEE/CVF Conference on Computer Vision and
Patnational Conference on Computer Vision, 2021, pp. tern Recognition, 2023, pp. 18000–18010.
10985–10995. [30] J. Kim, J. Kim, S. Choi, Flame: Free-form
language[18] K. Takeuchi, D. Hasegawa, S. Shirakawa, N. Kaneko, based motion synthesis &amp; editing, in: Proceedings
H. Sakuta, K. Sumi, Speech-to-gesture generation: of the AAAI Conference on Artificial Intelligence,
A challenge in deep learning approach with bi- volume 37, 2023, pp. 8255–8263.
directional lstm, in: Proceedings of the 5th Inter- [31] H. Kong, K. Gong, D. Lian, M. B. Mi, X. Wang,
national Conference on Human Agent Interaction, Priority-centric human motion generation in
dis2017, pp. 365–369. crete latent space, in: Proceedings of the IEEE/CVF
[19] E. Shlizerman, L. Dery, H. Schoen, I. Kemelmacher- International Conference on Computer Vision,
Shlizerman, Audio to body dynamics, in: Proceed- 2023, pp. 14806–14816.
ings of the IEEE conference on computer vision and [32] Y. Lou, L. Zhu, Y. Wang, X. Wang, Y. Yang,
Diversemotion: Towards diverse human motion gen- limits of transfer learning with a unified text-to-text
eration via discrete difusion, arXiv preprint transformer, Journal of machine learning research
arXiv:2309.01372 (2023). 21 (2020) , pp.1–67.
[33] J. Tseng, R. Castellon, K. Liu, Edge: Editable dance [45] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C.
Waingeneration from music, in: Proceedings of the wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
IEEE/CVF Conference on Computer Vision and Pat- A. Ray, et al., Training language models to follow
tern Recognition, 2023, pp. 448–458. instructions with human feedback, Advances in
[34] M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, neural information processing systems 35 (2022) ,
Z. Liu, Motiondifuse: Text-driven human motion pp. 27730–27744.
generation with difusion model, arXiv preprint [46] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
arXiv:2208.15001 (2022). G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
[35] K. Gong, D. Lian, H. Chang, C. Guo, Z. Jiang, X. Zuo, J. Clark, et al., Learning transferable visual
modM. B. Mi, X. Wang, Tm2d: Bimodality driven 3d els from natural language supervision, in:
Interdance generation via music-text integration, in: national conference on machine learning, PMLR,
Proceedings of the IEEE/CVF International Confer- 2021, pp. 8748–8763.</p>
        <p>ence on Computer Vision, 2023, pp. 9942–9952. [47] G. Tevet, B. Gordon, A. Hertz, A. H. Bermano,
[36] C. Guo, X. Zuo, S. Wang, L. Cheng, Tm2t: Stochastic D. Cohen-Or, Motionclip: Exposing human motion
and tokenized modeling for the reciprocal genera- generation to clip space, in: European Conference
tion of 3d human motions and texts, in: European on Computer Vision, Springer, 2022, pp. 358–374.
Conference on Computer Vision, Springer, 2022, pp.</p>
        <p>580–597.
[37] J. Zhang, Y. Zhang, X. Cun, Y. Zhang, H. Zhao, H. Lu,</p>
        <p>X. Shen, Y. Shan, Generating human motion from
textual descriptions with discrete representations,
in: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2023, pp.</p>
        <p>14730–14740.
[38] Z. Zhou, B. Wang, Ude: A unified driving engine
for human motion generation, in: Proceedings of
the IEEE/CVF Conference on Computer Vision and</p>
        <p>Pattern Recognition, 2023, pp. 5632–5641.
[39] J. Devlin, Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv
preprint arXiv:1810.04805 (2018).
[40] Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov,</p>
        <p>O. Pietquin, M. Sharifi, D. Roblek, O. Teboul,
D. Grangier, M. Tagliasacchi, et al., Audiolm: a
language modeling approach to audio generation,
IEEE/ACM transactions on audio, speech, and
language processing 31 (2023) , pp. 2523–2533.
[41] J. Martinez, H. H. Hoos, J. J. Little, Stacked
quantizers for compositional vector compression, arXiv
preprint arXiv:1411.2173 (2014).
[42] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund,</p>
        <p>M. Tagliasacchi, Soundstream: An end-to-end
neural audio codec, IEEE/ACM Transactions on Audio,
Speech, and Language Processing 30 (2021) , pp.</p>
        <p>495–507.
[43] W. Takano, Y. Nakamura, Statistical mutual
conversion between whole body motion primitives and
linguistic sentences for human motions, The
International Journal of Robotics Research 34 (2015) , pp.</p>
        <p>1314–1328.
[44] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang,</p>
        <p>M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>J.-M. Belda-Lois</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Mena-del Horno</surname>
            ,
            <given-names>I. BermejoBosch</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Pons</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Farina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Molinari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tamburella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramos</surname>
          </string-name>
          , et al.,
          <article-title>Rehabilitation of gait after stroke: a review towards a top-down approach</article-title>
          ,
          <source>Journal of neuroengineering and rehabilitation 8</source>
          (
          <year>2011</year>
          ) , pp.
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>M. De Cecco</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Luchetti</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Butaslac</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Pilla</surname>
            ,
            <given-names>G. M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Guandalini</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Bonavita</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mazzucato</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Hirokazu</surname>
          </string-name>
          ,
          <article-title>Sharing augmented reality between a patient and a clinician for assessment and rehabilitation in daily living activities</article-title>
          ,
          <source>Information</source>
          <volume>14</volume>
          (
          <year>2023</year>
          ) , Art. No.
          <volume>204</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Luchetti</surname>
          </string-name>
          , I. Butaslac,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fruet</surname>
          </string-name>
          , G. Nollo,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Ianes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gasperini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Achille Guandalini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bonavita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kato</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. De Cecco</surname>
          </string-name>
          ,
          <article-title>Multidimensional assessment of daily living activities in a shared augmented reality environment</article-title>
          ,
          <source>in: 2022 IEEE International Workshop on Metrology for Living Environment</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>60</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>M. M. Rahman</surname>
            ,
            <given-names>M. F.</given-names>
          </string-name>
          <string-name>
            <surname>Ishmam</surname>
            ,
            <given-names>M. T.</given-names>
          </string-name>
          <string-name>
            <surname>Hossain</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          <string-name>
            <surname>Haque</surname>
          </string-name>
          ,
          <article-title>Virtual reality based medical training simulator and robotic operation system</article-title>
          ,
          <source>in: 2022 International Conference on Recent Progresses in Science, Engineering and Technology (ICRPSET)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Medsker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jain</surname>
          </string-name>
          , et al.,
          <source>Recurrent neural networks, Design and Applications</source>
          <volume>5</volume>
          (
          <year>2001</year>
          ) , pp.
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          , et al.,
          <article-title>An introduction to variational autoencoders</article-title>
          ,
          <source>Foundations and Trends® in Machine Learning</source>
          <volume>12</volume>
          (
          <year>2019</year>
          ) , pp.
          <fpage>307</fpage>
          -
          <lpage>392</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Javed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          , L. Cheng, Momask:
          <article-title>Generative masked modeling of 3d human motions</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1900</fpage>
          -
          <lpage>1910</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Motiongpt: Human motion as a foreign language,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>