<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Italian Sign Language Generation for digital humans</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emanuele Colonna</string-name>
          <email>emanuele.colonna@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Arezzo</string-name>
          <email>arezzo@quest-it.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Roberto</string-name>
          <email>d.roberto8@studenti.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Landi</string-name>
          <email>d.landi@quest-it.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Vitulano</string-name>
          <email>felice.vitulano@quest-it.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gennaro Vessio</string-name>
          <email>gennaro.vessio@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanna Castellano</string-name>
          <email>giovanna.castellano@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Bari Aldo Moro</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>QuestIT S.r.l.</institution>
          ,
          <addr-line>Siena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the rapidly evolving field of human-computer interaction, the need for inclusive and accessible communication methods has become increasingly vital. This paper introduces an early exploration of Text-to-LIS, a new model designed to generate contextually accurate Italian Sign Language (LIS) gestures for digital humans. Our approach addresses the importance of non-verbal communication in virtual environments, focusing on enhancing interaction for the deaf and hard-of-hearing community. The core contribution of this work is developing an iterative framework that leverages a comprehensive multimodal dataset, integrating textual and audio inputs with visual data. Utilizing state-of-the-art deep learning algorithms and advanced human pose estimation techniques, the framework enables the progressive refinement of generated gestures, ensuring realism and contextual relevance. The potential applications of the Text-to-LIS model are wide-ranging, from improving accessibility in digital environments to supporting educational tools and promoting LIS in the digital age. The code is publicly available at: https://github.com/CarpiDiem98/text-to-lis/.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sign language generation</kwd>
        <kwd>Human pose estimation</kwd>
        <kwd>Digital humans</kwd>
        <kwd>Inclusive technology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The advancement of graphics and robotics technology has significantly contributed to the rise of virtual
and socially intelligent agents, making them increasingly popular for human interaction. This progress
has enabled the development of artificial agents with either virtual or physical embodiments, such as
avatars or robots, capable of interacting with humans across diverse settings. Among these, digital
humans are particularly impactful, replicating human form and behavior within virtual environments [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        A key component of efective interaction with digital humans is nonverbal communication, which
includes facial expressions, gestures, and body language [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Gestures, especially co-speech gestures
that accompany verbal communication, enhance these agents’ realism and engagement. However,
automatically generating natural and synchronized gestures remains a significant challenge due to the
complexity and diversity of human nonverbal communication [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>In this context, sign languages such as Italian Sign Language (LIS) introduce an even more complex
dimension of nonverbal communication. Sign languages are not simply gestures but fully developed
languages that serve as the primary means of communication for the deaf and hard-of-hearing
community. This paper addresses the challenge of generating realistic LIS gestures for digital human agents,
recognizing sign languages’ critical role in communication and the unique needs of the deaf community.</p>
      <p>Specifically, we propose a novel approach that employs an iterative refinement process, training a
model on a comprehensive dataset of text and image pairs representing LIS signs (Fig. 1). Our approach
integrates textual descriptions and visual data to generate accurate and expressive LIS gestures. The
P
o
s
e
E
n
c
o
d
e
r
Text-to-LIS Pipeline
Human Pose Estimation Pipeline</p>
      <p>Hybrik-X
HaMeR</p>
      <p>P
seo
T
e
x
t
E
n
c
o
d
re
Current Step</p>
      <p>Prediction
Encode Pose Step
PoseText Encoder</p>
      <p>SMPL-X</p>
      <p>Loss</p>
      <p>MPJPE
model has five main parts: a text encoder, which uses Transformers; a pose encoder, which handles
poses; a pose-text encoder, which combines the two; a step encoder, which makes refinements at each
step; and a projection module, which produces the final poses. This model captures linguistic and
visual aspects of LIS signs. The iterative process begins with an initial generic pose and progresses
through multiple steps. Advanced human pose estimation techniques serve as the ground truth for our
dataset, allowing for the precise capture and translation of human body movements into 3D animations
for virtual models. We present a robust solution for creating natural and coherent LIS gestures by
combining textual and visual data.</p>
      <p>By advancing technology for LIS gesture generation, we aim to achieve several important goals.
First, we seek to improve accessibility by generating accurate and natural LIS gestures, which can
enhance communication tools for the deaf community, making digital content and interactions more
accessible. Additionally, our work promotes LIS, a minority language facing challenges in preservation
and promotion, by contributing to its digital representation and documentation, thus supporting its
importance in the digital era. Another significant aim is to enhance education; accurate LIS gesture
generation can serve as a valuable resource for educational tools, helping deaf individuals learn written
Italian and aiding hearing individuals in acquiring LIS. Moreover, as virtual and augmented reality
technologies become prominent, it is crucial to ensure that LIS users can fully participate in these digital
environments, fostering inclusivity. Lastly, our model and dataset ofer valuable resources for linguistic
research, particularly for scholars studying the structure and patterns of LIS, thereby contributing to
the broader understanding of sign languages.</p>
      <p>The rest of this paper is structured as follows. Section 2 reviews the existing literature. Section 3
introduces the proposed dataset. Section 4 details the proposed Text-to-LIS model. Section 5 presents
preliminary results and discusses future work directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>Our Text-to-LIS model builds on several areas of research, including pose extraction, sign language
datasets, gesture generation, and Italian Sign Language research. This section provides an overview of
the relevant work in these fields.</p>
      <sec id="sec-2-1">
        <title>2.1. Pose extraction</title>
        <p>Pose extraction is essential for creating realistic digital humans, as it captures and translates human
body movements into 3D animations. Using computer vision techniques, pose estimation methods infer
human poses from images without requiring markers. These methods are typically categorized into
whole-body and single-part estimations, each with specific challenges.</p>
        <p>
          Several models have emerged to reconstruct human body posture from a single image. PIXIE [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
generates complete 3D models even with challenging poses or incomplete body information. Hand4Whole [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
simultaneously estimates both the full-body and hand poses, outperforming prior methods such as
FrankMocap [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and PIXIE. PyMAF-X [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] improves accuracy and speed, estimating SMPL-X parameters
with detailed joint rotation and depth information. SMPL-X (Skinned Multi-Person Linear model
with eXpressive hands and face) is a comprehensive 3D human body model that integrates detailed
representations of the body, face, and hands [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. SMPL-X parameters are the values used to configure
this model, including joint angles, body shape coeficients, and facial expression parameters.
        </p>
        <p>
          Hand pose estimation has also advanced significantly. Early approaches like those by Baek et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
and Boukhayma et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] used parametric hand models such as MANO [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] to match hand shapes to
images. Later methods, such as the already mentioned PyMAF-X [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], moved away from predefined
models, directly predicting the 3D shape of the hand point by point, allowing for greater detail and
lfexibility.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Sign language datasets</title>
        <p>
          While several datasets exist for various sign languages, there remains a need for more extensive and
diverse resources, especially for the Italian language. Notable sign language datasets include:
• RWTH-Phoenix-2014T: A German Sign Language (DGS) dataset with approximately 11 hours of
content [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
• Boston104: An American Sign Language (ASL) dataset with about 9 hours of video [14].
• How2Sign: A large-scale multimodal ASL dataset with 79 hours of content [15].
• TGLIS-227: A LIS dataset with approximately 19 hours of video [16].
        </p>
        <p>Other LIS datasets, such as those in [17, 18], are private or partially accessible. Our work aims to
complement and extend these existing resources by providing a novel and comprehensive multimodal
dataset for LIS, including video, audio, text, and extracted key points.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Gesture generation</title>
        <p>
          Recent advancements in gesture generation have focused on creating more natural and context-aware
movements. Yoon et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] proposed generating speech gestures using trimodal context, incorporating
text, audio, and speaker identity. Their approach highlights the importance of considering multiple
modalities for realistic gesture synthesis. Similarly, Yang et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] introduced DifuseStyleGesture,
a difusion-based model for generating stylized co-speech gestures, demonstrating the potential of
advanced generative models for creating diverse and expressive movements.
        </p>
        <p>In the context of sign language generation, Shi et al. [19] developed an open-domain sign language
translation system learned from online videos, showcasing the feasibility of generating sign language
from large-scale web data. Our Text-to-LIS model builds on these advancements by incorporating
iterative refinement and utilizing textual and visual information to generate accurate and expressive
LIS gestures.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Italian Sign Language research</title>
        <p>Research on LIS is growing, but there is still a need for more comprehensive studies and resources.
Marchisio et al. [17] introduced deep learning techniques with data augmentation for LIS recognition.
Fagiani et al. [18] contributed by creating a new LIS database, adding to the resources available for LIS
research. Bertoldi et al. [16] developed a large-scale Italian-LIS parallel corpus, which has been valuable
for machine translation and linguistic studies. However, their work primarily focused on text-based
representations rather than visual gesture generation.</p>
        <p>Our research extends these eforts by creating a more comprehensive LIS dataset and developing a
model specifically designed for generating realistic LIS gestures from textual input. This work bridges the
gap between textual representations and visual sign language production, contributing to computational
linguistics and assistive technology.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed dataset</title>
      <p>Our proposed dataset is a comprehensive, multimodal collection designed to advance research and
application development in Italian Sign Language. It addresses the scarcity of publicly available LIS data
and supports various applications, including human movement analysis, nonverbal communication
recognition, and understanding human behavior in digital environments.</p>
      <p>The dataset includes approximately 37 hours of LIS content:
• Video: High-quality video recordings of signers performing LIS during TV news broadcasts,
segmented to align with spoken phrases.
• Audio: Corresponding audio recordings, including the signer’s voice and ambient news sounds.
• Text: Transcriptions of the spoken content, initially generated using Whisper [20] and manually
corrected for accuracy.
• Key points: Body and hand joint positions, stored in pickle file format for each frame of the
videos.</p>
      <p>The segmented videos were generated based on transcriptions produced by Whisper. To streamline the
automated process, no preprocessing was applied to the transcription output. As noted qualitatively,
glossary extraction techniques, common in many datasets, were not applied as they can potentially
decrease the deaf community’s understanding of the movement. In this dataset, a whole sentence is
considered text.</p>
      <p>We utilized a fully automated web scraping mechanism to gather LIS news broadcast videos from
multiple platforms, primarily YouTube, while ensuring compliance with privacy regulations. This
approach allowed us to collect diverse signers and contexts, enhancing the dataset’s diversity and
representativeness. For key point extraction, we employed two state-of-the-art techniques:
• Hybrik-X [21]: Known for its accuracy and robustness, Hybrik-X is optimized for real-time
execution on mobile devices and performs well in high-detail scenarios.
• HaMeR [22]: HaMeR reconstructs a 3D hand mesh from a single RGB image, utilizing a Vision</p>
      <p>Transformer (ViT) [23] for detailed hand pose estimation.</p>
      <p>
        The extracted key points were normalized using the SMPL-X model [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], ensuring consistency between
the body and hand models. Figure 2 shows an overview of the multiple modalities collected in our
dataset.
      </p>
      <p>Compared to current state-of-the-art sign language datasets, our LIS dataset stands out in its
multimodal nature and substantial duration (see Table 1). While other LIS datasets exist, they are often
either private or limited in accessibility [17, 18]. We aim to continually expand this dataset to enhance
its utility for LIS research.
Il comitato scientifico che si
occuperà della promozione e
valorizzazione del monumento...</p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed method</title>
      <p>This section presents our Text-to-LIS model for automatic gesture generation based on textual
descriptions. Our approach builds on the work of Zhang et al. [24], employing an iterative refinement process
to generate a sequence of poses from textual input generated by automatic transcription [20]. The key
innovation of our method lies in its ability to progressively enhance pose quality through multiple
refinement steps, leveraging both textual and positional information.</p>
      <sec id="sec-4-1">
        <title>4.1. Model architecture</title>
        <p>The core components of our Text-to-LIS model, shown in Fig. 1, include:
• Text encoder: A Transformer-based encoder that processes text embeddings to generate a dense
representation of the input text. It uses multi-head attention mechanisms and feed-forward neural
networks to capture the contextual relationships between tokens. The text encoder receives the
corresponding phrase of the LIS gesture as its input.
• Pose encoder: A Transformer-based encoder designed to handle the sequence of poses. This
encoder applies attention mechanisms to the current state of the gesture (which, in the initial
iteration, is a generic starting pose) to represent the directional matrices.
• Pose-text encoder: This component combines and processes the joint information from text and
pose data.
• Step encoder: A small neural network representing the current iterative process step. It refines
this representation with embedding layers, integrating information from previous steps to inform
subsequent pose adjustments.
• Projection module: This module transforms hidden representations into final poses, mapping the
refined poses back into the appropriate output space.</p>
        <p>The process is iterated, with each iteration taking the output from the previous step as its new input.
This iterative approach attempts to translate sentences into fluid and accurate movements efectively.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Iterative refinement process</title>
        <p>The iterative refinement process is the heart of our Text-to-LIS model. It works similarly to an artist
creating a painting, starting with a rough sketch and gradually refining it through multiple steps until a
detailed work of art emerges. The process begins with a textual input (a description of a gesture in LIS)
and an initial generic pose, which serves as a foundation. From this starting point, the model iterates
through a series of refinements, progressively improving the pose.</p>
        <p>At each iteration, the text and current pose are processed by their respective encoders, allowing the
model to “understand” both the description and the current pose. The step encoder keeps track of the
progress made so far, integrating information from previous refinements. Based on this understanding,
the model outputs an improved pose version. This process repeats over several iterations, with each
cycle producing a more accurate and detailed representation of the LIS gesture described in the text.
This gradual refinement allows the model to capture subtle nuances and correct errors step by step,
leading to more natural and expressive gesture generation.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Training procedure</title>
        <p>Training our Text-to-LIS model involves strategies to facilitate efective learning, creativity, and precision.
Two key techniques used during training are:
• Teacher forcing [25]: Similar to guiding an apprentice artist, this technique alternates between
allowing the model to make its predictions and providing the correct pose for the next step.
This approach enables the model to learn from independent attempts and supervised guidance,
improving its ability to generate accurate poses.
• Controlled noise injection: To improve robustness and flexibility, we introduce random variations
(or “noise”) into the poses during training. This involves adding small Gaussian noise to joint
positions. This is akin to practicing under diferent conditions—such as using diferent brushes or
lighting in art—which helps prevent overfitting and encourages the model to learn the underlying
structure of LIS gestures.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Preliminary results and future work</title>
      <p>We conducted exploratory experiments training the model for 200 epochs with a batch size of 16. Our
analyses were performed on a subset of the dataset, consisting of approximately six thousand videos,
each with an average duration of ten seconds. Table 2 summarizes the hyperparameters used during
model training and evaluation.</p>
      <p>Two loss functions were employed to evaluate the model’s performance: the MPJPE (Mean per Joint
Position Error) and a refined loss specifically designed to account for the model’s confidence in each
predicted pose point. To define this loss function, first, the squared error between the ground truth
pose  and the predicted pose ^  is calculated:
This error is then weighted by a confidence vector  that represents the model’s certainty about each
predicted joint position, leading to the loss function:</p>
      <p>= ‖^  −  ‖2.</p>
      <p>=  ‖^  −  ‖2.</p>
      <p>This loss function enables the model to prioritize joints with higher confidence while assigning less
weight to uncertain predictions. Finally, the mean weighted error is calculated and normalized, yielding
the final refined loss:
refined =</p>
      <p>1 ∑︁ ∑︁  · log( + 1),
 ·  =1 =1
(1)
(2)
(3)
where  represents the number of the samples in the batch multiplied by the number of the joints
in each pose  . A key feature of this loss is the normalization based on the number of model steps ,
computed with the logarithmic function log( + 1). The MPJPE is a widely used metric for assessing
the accuracy of 3D pose estimation. It quantifies the average discrepancy between predicted and actual
joint positions across all samples:</p>
      <p>MPJPE =</p>
      <p>1 ∑︁ ∑︁ ‖^  −  ‖2,
 ·  =1 =1
(4)
where ^  represents the predicted 3D coordinate for joint  of sample ,  is the corresponding ground
truth,  is the number of samples, and  is the number of joints per sample.</p>
      <p>We conducted experiments comparing diferent configurations to determine the optimal number of
refinement steps. As shown in Table 3, increasing the number of refinement steps significantly improves
the quality of the generated poses. The improvement was most pronounced up to ten refinement passes,
after which further increases produced diminishing returns and significantly increased the generation
time. Specifically, with ten refinement passes, the optimal balance between the generated poses’ accuracy
and the model’s computational demands was observed. Each refined step took approximately a few
seconds when training the model on a GeForce RTX 4090 graphics card.</p>
      <p>The preliminary results demonstrate the efectiveness of the Text-to-LIS model in generating realistic
LIS poses from textual descriptions. The model’s iterative refinement approach produces high-quality
poses, as evidenced by qualitative evaluation. These results (Fig. 3) indicate the model’s potential as a
valuable tool for enhancing digital human interactions, virtual reality environments, and nonverbal
communication systems.</p>
      <p>While the results are promising, several avenues for further research and development remain.
Expanding the dataset with more diverse signers, gestures, and contexts is essential to improve the
model’s generalization capabilities. On the technical side, investigating advanced attention mechanisms
and temporal modules may help the model better capture long-term dependencies and subtle nuances
in gestures. Real-time sign language generation is another critical goal for practical applications,
maniera
gravissima
che
un
altro</p>
      <p>ragazzo
è
avvenuto
martedì
sera di</p>
      <p>fronte
al
santuario
di</p>
      <p>Fosciandora
il
dolore
di
tutta
la
comunità.
and techniques like model pruning and quantization could reduce computational complexity without
sacrificing accuracy. Given that sign language communication is inherently multimodal, future work
should also focus on integrating hand gestures, facial expressions, and body language into a unified
model to generate more natural and expressive LIS gestures. Moreover, although the current focus is on
LIS, the techniques developed in this research could be adapted to other sign languages or nonverbal
communication systems, broadening the scope and impact of this work.</p>
      <p>Finally, collaboration with the deaf community, linguists, and technologists will be essential to ensure
that our advancements are both technically sound and socially impactful.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was supported by a PhD fellowship awarded to Emanuele Colonna, funded under the
Italian National Recovery and Resilience Plan (D.M. n. 117/23), Mission 4, Component 2, Investment 3.3.
The PhD project, titled “Study of AI Techniques for Eficient Generation of Digital Humans and 3D
Environments” (CUP H91I23000690007), is co-funded by QuestIT S.r.l. Additionally, this research was
partially supported by the UNIBA-MAML (Microsoft Azure Machine Learning) agreement.
[14] P. Dreuw, D. Rybach, T. Deselaers, M. Zahedi, H. Ney, Speech recognition techniques for a sign
language recognition system, hand 60 (2007) 80.
[15] A. Duarte, S. Palaskar, L. Ventura, D. Ghadiyaram, K. DeHaan, F. Metze, J. Torres, X. Giro-i
Nieto, How2sign: a large-scale multimodal dataset for continuous american sign language, in:
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp.
2735–2744.
[16] N. Bertoldi, G. Tiotto, P. Prinetto, E. Piccolo, F. Nunnari, V. Lombardo, A. Mazzei, R. Damiano,
L. Lesmo, A. Principe, On the creation and the annotation of a large-scale italian-lis parallel corpus,
International Conference on Language Resources and Evaluation (2010) 19–22.
[17] M. Marchisio, A. Mazzei, D. Sammaruga, Introducing Deep Learning with Data Augmentation
and Corpus Construction for LIS, in: Italian Conference on Computational Linguistics, 2023. URL:
https://api.semanticscholar.org/CorpusID:266726316.
[18] M. Fagiani, S. Squartini, E. Principi, F. Piazza, A New Italian Sign Language Database, 2012.</p>
      <p>doi:10.1007/978-3-642-31561-9_18.
[19] B. Shi, D. Brentari, G. Shakhnarovich, K. Livescu, Open-Domain Sign Language Translation</p>
      <p>Learned from Online Video, in: EMNLP, 2022.
[20] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, Robust speech recognition
via Large-Scale Weak Supervision, 2022.
[21] J. Li, C. Xu, Z. Chen, S. Bian, L. Yang, C. Lu, Hybrik: A hybrid analytical-neural inverse kinematics
solution for 3d human pose and shape estimation, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2021, pp. 3383–3393.
[22] G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, J. Malik, Reconstructing hands
in 3d with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2024, pp. 9826–9836.
[23] A. Kolesnikov, A. Dosovitskiy, D. Weissenborn, G. Heigold, J. Uszkoreit, L. Beyer, M. Minderer,
M. Dehghani, N. Houlsby, S. Gelly, T. Unterthiner, X. Zhai, An image is worth 16x16 words:
Transformers for image recognition at scale, 2021.
[24] M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, Z. Liu, Motiondifuse: Text-driven human
motion generation with difusion model, arXiv preprint arXiv:2208.15001 (2022).
[25] S. Bengio, O. Vinyals, N. Jaitly, N. Shazeer, Scheduled sampling for sequence prediction with
recurrent Neural networks, in: Proceedings of the 28th International Conference on Neural
Information Processing Systems - Volume 1, NIPS’15, MIT Press, Cambridge, MA, USA, 2015, p.
1171–1179.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bonetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Hromei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Stranisci</surname>
          </string-name>
          , Preface to the
          <source>Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI)</source>
          ,
          <source>in: Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI</source>
          <year>2024</year>
          )
          <article-title>co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA</article-title>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>M.-Y. Yin</surname>
            ,
            <given-names>J.-G.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A systematic review on digital human models in assembly process planning</article-title>
          ,
          <source>The International Journal of Advanced Manufacturing Technology</source>
          <volume>125</volume>
          (
          <year>2023</year>
          )
          <fpage>1037</fpage>
          -
          <lpage>1059</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Speech gesture generation from the trimodal context of text, audio, and speaker identity</article-title>
          ,
          <source>ACM Transactions on Graphics (TOG) 39</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Hao,
          <string-name>
            <given-names>W.</given-names>
            <surname>Bao</surname>
          </string-name>
          , M. Cheng, L. Xiao, Difusestylegesture:
          <article-title>Stylized audio-driven co-speech gesture generation with difusion models</article-title>
          ,
          <source>in: Proceedings of the ThirtySecond International Joint Conference on Artificial Intelligence, IJCAI-23, International Joint Conferences on Artificial Intelligence Organization</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>5860</fpage>
          -
          <lpage>5868</lpage>
          . URL: https://doi.org/10. 24963/ijcai.
          <year>2023</year>
          /650. doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2023</year>
          /650.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Choutas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bolkart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tzionas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <article-title>Collaborative Regression of Expressive Bodies using Moderation</article-title>
          ,
          <source>in: International Conference on 3D Vision (3DV)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Moon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. M.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Accurate 3D Hand Pose Estimation for Whole-Body 3D Human Mesh Estimation, in: Computer Vision</article-title>
          and Pattern Recognition Workshop (CVPRW),
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Shiratori</surname>
          </string-name>
          , H. Joo,
          <article-title>FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration</article-title>
          , in: IEEE International Conference on Computer Vision Workshops,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          , Y. Liu, Pymaf-x:
          <article-title>Towards well-aligned full-body model regression from monocular images</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pavlakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Choutas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ghorbani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bolkart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A. A.</given-names>
            <surname>Osman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tzionas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <article-title>Expressive Body Capture: 3D Hands, Face, and Body from a Single Image</article-title>
          ,
          <source>in: Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Baek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. I.</given-names>
            <surname>Kim</surname>
          </string-name>
          , T.-K. Kim,
          <article-title>Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1067</fpage>
          -
          <lpage>1076</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Boukhayma</surname>
          </string-name>
          , R. d. Bem,
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Torr</surname>
          </string-name>
          ,
          <article-title>3d hand shape and pose from images in the wild</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>10843</fpage>
          -
          <lpage>10852</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tzionas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <article-title>Embodied hands: modeling and capturing hands and bodies together</article-title>
          ,
          <source>ACM Trans. Graph</source>
          .
          <volume>36</volume>
          (
          <year>2017</year>
          ). URL: https://doi.org/10.1145/3130800.3130883. doi:
          <volume>10</volume>
          . 1145/3130800.3130883.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Forster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Koller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bellgardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ney</surname>
          </string-name>
          ,
          <article-title>Extensions of the Sign Language Recognition and Translation Corpus RWTH-PHOENIX-Weather</article-title>
          , in: LREC,
          <year>2014</year>
          , pp.
          <fpage>1911</fpage>
          -
          <lpage>1916</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>