<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EAM: Emotional Avatar Generation for the Metaverse</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ander González-Docasal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan Camilo Vásquez-Correa</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aitor Álvarez</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aritz Lasarguren</string-name>
          <email>aritz@baleuko.eus</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jone López</string-name>
          <email>jonelopez@baleuko.eus</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Egoitz Rodríguez</string-name>
          <email>egoitz@baleuko.eus</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aragon Institute for Engineering Research, University of Zaragoza</institution>
          ,
          <addr-line>Zaragoza</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Baleuko S.L.</institution>
          ,
          <addr-line>Durango</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Fundación Vicomtech, Basque Research and Technology Alliance (BRTA)</institution>
          ,
          <addr-line>Donostia - San Sebastián</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents Project EAM, a framework for generating and managing avatars with emotionally expressive speech synthesis and automatic facial animation. The system supports two Iberian languages, Spanish and Basque, and synthesises speech in four distinct emotional tones. Its built-in Speech Synthesis system, trained with professionally recorded emotional voices, is complemented by an automatic forced aligner to generate viseme sequences. By incorporating real-time facial animation prediction, it enhances lip-sync accuracy and expands its applications to the scopes of augmented reality, video games, and film production. The system allows for a complete integration in the Unity 3-D engine, boosting the productivity of animators by reducing times on demanding tasks. In conclusion, Project EAM contributes to bilingual digital content creation and advances expressive avatar animation for the Metaverse and interactive media production.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Voice cloning</kwd>
        <kwd>Emotional Speech Synthesis</kwd>
        <kwd>Speech to Text Alignment</kwd>
        <kwd>Talking Face Generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The Metaverse represents a rapidly evolving digital ecosystem where users interact in immersive,
interconnected virtual environments. Enabled by advancements in virtual reality (VR), augmented
reality (AR) and artificial intelligence (AI), the Metaverse aims to reshape how people will socialise, work,
and engage with digital content. As major industries invest in its development, the Metaverse is on the
verge to become a fundamental extension of real-world experiences, fostering new economic models,
communication methods, and creative expressions. Its relevance continues to grow as technological
infrastructure improves, making it a critical domain for research and innovation in human-computer
interaction and digital identity.</p>
      <p>
        Nowadays, numerous options exist for developing avatars for the Metaverse, with many solutions
available on the market. These options primarily focus on providing support for video games or
social networking platforms, such as VRChat1 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, due to the specific requirements of these
environments, current solutions often fall into two extremes: either ofering minimal personalisation
– where users are limited to human-like avatars – or providing an overwhelming number of
customisation options, which can result in inconsistent and incoherent virtual scenarios [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Furthermore, the
animation of avatars in these digital spaces is typically constrained by a predefined set of motions or
executed with a lack of precision, particularly evident in facial movements associated with speech and
emotional expression, such as lip and eyebrow articulation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Therefore, when specifically considering the domain of animation, there are relatively few options
available that efectively address the conveyance of emotional content – an essential factor in ensuring
a minimum standard of quality, since facial and body movements must be as faithful and expressive
as possible [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Additionally, many existing solutions rely on pre-recorded voice performances from
actors to drive the corresponding animations. In this context, avatars that incorporate synthetic speech
generation alongside synchronised lip movements present a valuable alternative, ofering significant
benefits for animators and studios [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Such technology can be employed to create proofs of concept
or even short animated sequences before final in-studio recordings take place. Nevertheless, current
applications providing these services often generate audio that lacks expressiveness, emotional nuance,
or naturalness – essential components to be expressed in animation, particularly on content designed
for a child audience. Moreover, the scarce tools that provide emotional speech synthesis often focus
on major languages such as English or Mandarin, with no option for communities with a more modest
number of speakers.
      </p>
      <p>In this context, we introduce EAM: Emotional Avatar generation for the Metaverse, a project
designed to facilitate the creation and management of avatars for both real-time interactions and
preprogrammed events. The system is capable of generating emotionally expressive speech from a
predefined script and automatically animating avatars based on textual or acoustic inputs. This solution
implements speech technologies to the emerging field of the Metaverse, while also extending its
applicability to well-established industries such as video game development and 3-D animation for films
and television series. Furthermore, it is designed to support real-time scenarios, enabling avatars to
produce highly accurate facial animations driven solely by an actor’s voice, with minimal latency.</p>
      <p>The solution is available in two Iberian languages, Spanish and Basque, thereby facilitating bilingual
content creation and broadening audience reach, while simultaneously contributing to the preservation
and promotion of regional minority languages. Moreover, the speech synthesis engine is capable of
generating audio using two distinct speaker voices, each available in four diferent emotions, ensuring
expressive and natural-sounding output in both supported languages.</p>
      <p>This solution has been successfully employed to generate short commercial clips for social media
engagement using predefined scripts. By automating the synchronisation of speech with mouth and
lip movements, it significantly enhances the productivity of animators, reducing the time required for
this otherwise labour-intensive task. These clips have been seamlessly produced in both Basque and
Spanish, efectively reaching to a wider audience within a bilingual regional context.</p>
      <p>The project EAM has been partially funded by the Spanish Government’s call for year 2022 granting
of aid for R&amp;D projects in the audiovisual and video game fields, supported by European Union’s
Next Generation funds. The project started in September 2022 and finalised in December 2024, and
was carried out by the consortium composed by the producer company Baleuko2, the professional
recording studio Sonora3, and Vicomtech Foundation4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Project goals and challenges</title>
      <p>The objective of EAM was to develop a system capable of managing avatars that autonomously
generate emotionally expressive animations from textual or acoustic inputs. This encompassed the following
key goals and challenges:
• Avatars should automatically generate synthetic emotional speech from a script. The
system must incorporate a Speech Synthesis engine that leverages emotional embeddings to
ensure accurate and expressive audio generation.
• Two diferent characters should be available, both speaking Spanish and Basque,
featuring four diferent emotions . The system should include one or more models capable of
2https://baleuko.es
3https://www.sonoraestudios.es/
4https://www.vicomtech.org/
generating every possible combination of these characteristics, ensuring a seamless
interpretation of any legible input string, particularly in terms of prosody and phoneme production.
• The generated speech should be accompanied by synchronised facial animation. The
system produce a sequence of visemes (mouth shapes corresponding to each phoneme) that are
precisely aligned with the output audio, including accurate start and end timestamps.
• Live animation should be supported. This system must generate real-time mouth and lip
movements based on acoustic inputs, allowing for dynamic and responsive avatar animation.
• The system should be integrated with Unity5. This 3-D engine enables the development of
a wide range of productions, including pre-rendered animations for films and series, as well as
real-time applications such as AR experiences and video games.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Speech synthesis engine</title>
      <p>The core component of the EAM solution is the Speech Synthesis engine. As previously stated, it must
be capable of generating speech for two diferent characters, expressing four diferent emotions, in
both Basque and Spanish.</p>
      <p>The development of this system began with its design process. Given that the primary application
of this project is the animation of children’s cartoons, the selection of emotions was based on the
characteristics typically found in such productions. To ensure clarity and expressiveness, two positive
and two negative emotions were chosen, with each further distinguished by varying levels of energy.
This classification not only enhances emotional depth but also prevents potential ambiguity between
categories, ensuring precise and consistent vocal expression. The selected emotions and their
corresponding features are shown in Table 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Data generation</title>
        <p>
          The next phase focused on generating the necessary data for training the Text-to-Speech (TTS) models.
Initially, the contents of multiple scripts from the producer company Baleuko were extracted. These
texts contained authentic examples from real shows in both Basque and Spanish, and therefore
provided a valuable foundation for fine-tuning the models to better reflect real-world usage. To further
expand the dataset and accommodate a greater volume of recordings, the corpus was supplemented
with public domain texts from Common Voice [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          Once the textual data was collected, a proprietary BERT-based [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] general-purpose sentiment
detection module was employed to classify each sentence and map its results into one of the predefined
emotional categories. As this emotion recognition tool was only available in Spanish, Basque texts
were first translated into Spanish using Itzuli, a high-quality Neural Machine Translation (NMT)
system accessible via API upon request6 before undergoing emotional classification, an approach that has
demonstrated its robustness for Basque-Spanish translation [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>After categorisation, the script content was segmented into paragraphs to facilitate the reading task
to the professional voice actors recording the database. Each paragraph was assigned a category based
on the emotional classification of its constituent sentences, following heuristics designed to ensure</p>
        <sec id="sec-3-1-1">
          <title>5https://unity.com/</title>
          <p>6https://itzuli.vicomtech.org/api/
a balanced distribution across all classes. Finally, textual content belonging to Common Voice was
designated a random emotion as these sentences are mostly neutral or informative.</p>
          <p>The next step involved recording the textual corpus in a professional studio. Two characters from
Baleuko-produced shows were selected as the primary voices for this project: Ezki (F) and Mithy (M).
Their respective voice actors were provided with a pronunciation guide to minimise variability,
particularly in the articulation of questions and foreign words. Moreover, in order to reduce studio costs,
textual corrections were preferred over re-recording in cases where minor speech errors occurred. To
optimise dataset balance, greater emphasis was placed on recordings for the Neutral emotion, as it was
expected to be the most frequently used category. Conversely, the number of recordings for the Fear
emotion was reduced, given that it was not anticipated to play a major role in the intended applications.</p>
          <p>
            Once the recordings were completed, a forced-aligned model based on Kaldi [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] was employed to
align phoneme-level timestamps for each language. Initially, session-level processing was conducted
using a beam size of 40 and a retry beam of 70 to ensure correct alignment across all audio files.
Following this, sentence-level alignment was performed with beam sizes of 1 and 2, discarding unaligned
segments under the assumption that they were insuficiently literal. Nevertheless, filtering process
resulted in the removal of only 0.67 % of sentences from the final corpus. The total amount of material
comprising the final database can be found in Table 2.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Speech synthesis models</title>
        <p>
          The next phase involved training Speech Synthesis models with all the gathered data. Given the
substantial amount of recorded acoustic material, the well-consolidated Tacotron 2 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] was selected as
the TTS framework. This architecture leverages an encoder-decoder with attention for generating Mel
spectrograms given an input character sequence. To enhance its emotional expressiveness, the models
were further improved by incorporating emotional embeddings, following an approach similar to that
of [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. While Cho et al. applied this technique in a multi-speaker setting, the approach taken in this
project was a multi-emotional framework instead. Consequently, four separate models were trained,
one for each combination of speaker and language, each incorporating an emotion-encoder.
        </p>
        <p>Initially, a single model per speaker was trained, encompassing both languages and all four
emotional classes, using a sampling rate of 22 050 Hz. These models were fine-tuned from a pre-existing
model trained on a single Spanish speaker without prior knowledge of Basque or emotional nuances.
Each model was trained for 100 k steps on a batch size of 6, starting with an initial learning rate of 10−3
exponentially decaying to 10−5 after the first 50 k steps. The emotional embeddings were represented
as 16-dimensional vectors, trained alongside the network.</p>
        <p>Following this, a separate model was trained for each language, using only audio samples with a
maximum duration of 10 seconds. This restriction allowed for a higher batch size of 32. These models
underwent an additional 50 k training steps with an initial learning rate of 3 ⋅ 10−4.</p>
        <p>
          Finally, a HiFi-GAN [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] vocoder was trained per speaker using the ground-truth aligned
spectrograms generated by the final Tacotron 2 models, responsible for converting the outputs of the TTS
module into audio waveforms. Each had an initial learning rate of 2 ⋅ 10−4 and a batch-size of 16,
maintaining the 22,050 Hz sampling rate for consistency with the synthesis models. The resulting four
models, along their respective vocoders, were proven to successfully synthesise speech across both
languages and all four diferent categories.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Inference pipeline</title>
        <p>The final component of the Speech Synthesis engine corresponded to its inference pipeline. It consisted
of the following components:
• Text Preprocessor: This module processes the input text, converting special characters such
as numerical values into their fully readable form. It supports the two main languages from this
project: Spanish and Basque.
• Phonemiser: It receives the normalised text from the Text Preprocessor and converts it into its
phonetic representation. Given the relatively phonetic orthographies of the supported languages,
this module operates using a rule-based approach.
• TTS system: This component loads the four trained models alongside their respective vocoders
into RAM, prepared for generating an audio signal from any incoming textual input and its
desired emotion.
• Forced aligner: Leveraging the aforementioned forced-align module based on Kaldi, this
component aligns the phonetic transcription produced by the Phonemiser with the generated speech
output from the TTS system. It determines precise phoneme timestamps and maps them to their
corresponding visemes in the form of a JSON object.
• Orchestrator: This element serves as the central controller, managing the various pipelines
for the multiple combination of languages and speakers. It ensures seamless integration of all
components, handling user requests and processing inputs and outputs in a manner agnostic to
the end user.</p>
        <p>Each component is implemented as a RESTful API built in Go, facilitating modular communication via
JSON objects. The whole system takes as input a text and a desired emotion, subsequently generating
an audio file encoded as a base64 string or available for direct download. Additionally, a JSON object
providing detailed timing information for each viseme is returned. A schematic drawing of the whole
pipeline is shown in Figure 1.</p>
        <p>t
x
e
T</p>
        <p>Text
Preprocessor
Phonemiser</p>
        <p>Emotion</p>
        <p>TTS
system
Forced
aligner</p>
        <p>This backend is then coupled with an automatic video sequence generation graphical user
interface (GUI), which permits the creation of fully customisable animations using the Speech Synthesis
engine. Using a configuration panel, the user is able to introduce text inputs along their corresponding
emotional input and predefined body movements. The frontend incorporates fine grained camera
control for constructing cinematic sequences, which are then exported to tracks and clips inside Unity’s
timeline. An image of the resulting GUI integrated in Unity is shown in Figure 2.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Speech-driven avatar animation</title>
      <p>
        When generating facial animations from speech, it is important not only to ensure lip synchronisation,
but also to transfer the emotions and intent from the audio to the avatar [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Humans are highly
sensitive to facial expressions, making inconsistencies between speech and animation distracting and
confusing [14]. Therefore, high-fidelity animation becomes essential to improve user experience in
immersive environments.
      </p>
      <p>Speech facial animation technologies fall into two broad categories based on complexity and
expressiveness. Some engines leverage large-scale neural models for highly nuanced animations [15, 16].
However, these solutions demand significant computational resources, limiting their suitability for
real-time processing during live events. On the contrary, simpler libraries based on viseme
recognition [17] ofer faster animation, but lack emotional expressiveness and intent transfer [ 18].</p>
      <sec id="sec-4-1">
        <title>4.1. Facial representation</title>
        <p>
          One of the core aspects of designing a speech-based facial animation engine is to find the most
appropriate facial representation that can be predicted from speech signals and transferred to the avatar.
Current systems are based on facial landmarks [18, 19] and 3D facial meshes [
          <xref ref-type="bibr" rid="ref13">13, 20, 16</xref>
          ], which are
able to produce high fidelity and natural animations. However, they have limitations in computational
eficiency and software compatibility.
        </p>
        <p>For EAM, we considered a standard facial representation based on ARKit deformation blendshapes7.
The main reason is the integration with animation engines like Unity, which already include some
features for rendering these types of representations and transferring them to an avatar. ARKit’s
blendshapes consist of 52 coeficients that model the movement of diferent parts of the face, including the
eyes, eyebrows, mouth, jaw, and nose.
4.1.1. Training data
The facial animation model was trained using the CREMA-D corpus [21]. This is an emotional
multimodal acted dataset, traditionally used for speech emotion recognition. Actors uttered a selection of
12 sentences in six emotions (Anger, Disgust, Fear, Happiness, Neutral, and Sadness) and three levels
(Low, Medium, High), in English. This corpus has been used in similar studies, particularly in realistic
talking face generation using facial landmark representations [22, 23]. Despite the language diference,</p>
        <sec id="sec-4-1-1">
          <title>7ARkit blendshapes https://arkit-face-blendshapes.com/</title>
          <p>as the principal objective of this model is to capture the main facial movements, it was supposed to
be suficient for this task due to the unavailability of existing data in the two target languages. Labels
for blendshapes are obtained from the facial videos of the actors using the MediaPipe blendshape V2
model [24].</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Facial animation model</title>
        <p>The facial animation engine comprises a neural architecture trained to predict a continuous stream of
blendshape coeficients from speech signals. The considered model relies on the combination of an
upstream model, frozen during training and aimed to extract high quality representations of the input
audio, and a downstream model trained to predict the 52 ARKit coeficients. A scheme of the trained
model is observed in Figure 3.</p>
        <p>WavLM Base+
Transformer layer k
Transformer layer 2
Transformer layer 1</p>
        <p>CNN encoder</p>
        <p>The WavLM Base+ model [25] extracts high-level representations from the input audio while being
light enough to be used in real-time scenarios. The hidden representations from each Transformer layer
are combined with a layer-wise pooling mechanism in order to obtain a single feature embedding per
frame. The layer’s weights ( ) were implemented with a learnable vector followed by a Softmax layer.
Finally, the sequence of feature embeddings is processed by a temporal convolution network to predict
the 52 blendshapes per frame.</p>
        <p>For training, we employed the Smooth-L1 loss function and implemented a 5-fold speaker
independent cross-validation strategy, using four folds for training and development, and the remaining one
for independent testing. The models were trained using Adam optimiser, with a batch size of 32 audio
samples, a learning rate of 10−5 and dropout of 0.1. The input audio corresponds to frames of 280 ms
(7 frames at 25 fps), predicting the blendshapes for the central frame leveraging the rest as past and
future context.</p>
        <p>The quality of the predictions was evaluated using the mean absolute error (MAE) between the
predicted and real blendshapes for diferent parts of the face. The average error in coeficient estimation
is 13.3 %. Mouth, jaw, and cheek areas were observed to be more accurately animated than eyes or
eyebrows, as expected considering our model being solely based on speech inputs. These results
outperformed those obtained in previous studies [26], where a 1D-CNN model was considered for the
same problem and using the same data. The quality of the predicted blendshapes were also
perceptually evaluated when using synthetic speech generated with the previously described TTS models.</p>
        <p>In such cases, the avatar reflected accurate lip movements and emotion present in the input audio.</p>
        <p>4.3. Real time processing
This model will continue to evolve once the project is finished, to ensure lower errors and smoother
real-time interaction.</p>
        <p>The facial animation model is integrated into a client-server application based on FFmpeg [27] and
WebSocket for real-time audio processing. The client transmits continuous audio streams of 1024 bytes
(32 ms of audio sampled at 16 kHz and 16-bit resolution) to the server, which receives and bufers the
stream. Once 280 ms of audio is stored, the server predicts facial animations for the central frame
and sends the results back to the client for rendering in Unity. After processing, the server releases the
corresponding 40 ms audio segment and waits for new frames to arrive. An overview of the processing
setup is shown in Figure 4.</p>
        <p>This engine is further enhanced with a scenario for real-time events that integrates multiple
technologies for allowing the user a fine control of body animations of avatars. It includes the following
components:
• Rokoko motion capture: to control body animations leveraging a motion capture suit.
• ARKit face capture: to further enhance facial animation of harder elements such as eyes and
eyebrows.
• Controlling of animations from a handheld device: a video game controller is used to
launch predefined animations to the avatar.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This work describes project EAM, a framework designed to facilitate the generation and management
of two avatars, enabling speech synthesis in four distinct emotional tones while automatically
animating corresponding mouth and lip movements. By automating and streamlining the traditionally
labour-intensive process of vocal animation, EAM significantly enhances the productivity of animators
and accelerates the creation of animated clips within the Unity 3-D engine.</p>
      <p>Furthermore, the solution is available in two Iberian languages – Spanish and Basque – expanding
the scope of the automatically generated content to a bilingual audience. Finally, the incorporation of
real-time prediction of facial animations permits the application of this tool to many more scenarios,
such as AR or video games.</p>
      <p>As future work, research on further fine tuning of real-time blendshape detection algorithms will be
performed. Additionally, eforts will be made to explore novel AI techniques in order to transfer the
emotional capabilities of the TTS system to new unseen speakers or languages, reducing the costs of
scaling the solution to new speakers.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Project EAM was partially funded by the 2022 procedure for the granting of aid for business R&amp;D
projects applicable to the audiovisual and video game fields, and the transfer and adaptation of these
developments to other application sectors, within the framework of the Plan to Boost the
Audiovisual Sector Spain Audiovisual Hub of Europe; the “Digital Spain 2025” Agenda; the Recovery,
Transformation and Resilience Plan; and the State Plan for Scientific, Technical and Innovation Research
2021–2023.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT in order to: Grammar and spelling
check. After using the tool, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[14] A. Alsius, et al., Forty years after hearing lips and seeing voices: The mcgurk
effect revisited, Multisensory Research 31 (2018) 111–144. doi:https://doi.org/10.1163/
22134808- 00002565.
[15] K. D. Yang, et al., Probabilistic speech-driven 3D facial motion synthesis: New benchmarks,
methods, and applications, arXiv preprint arXiv:2311.18168 (2023). doi:https://doi.org/10.
48550/arXiv.2311.18168.
[16] Q. Zhao, et al., Media2face: Co-speech facial animation generation with multi-modality
guidance, arXiv preprint arXiv:2401.15687 (2024). doi:https://doi.org/10.48550/arXiv.
2401.15687.
[17] P. Edwards, et al., JALI: an animator-centric viseme model for expressive lip synchronization,</p>
      <p>ACM Transactions on Graphics (TOG) 35 (2016) 1–11. doi:10.1145/2897824.2925984.
[18] S. Taylor, et al., A deep learning approach for generalized speech animation, ACM Transactions</p>
      <p>On Graphics (TOG) 36 (2017) 1–11. doi:10.1145/3072959.3073699.
[19] A. Vidal, C. Busso, Multimodal attention for lip synthesis using conditional generative
adversarial networks, Speech Communication 153 (2023) 102959. doi:https://doi.org/10.1016/j.
specom.2023.102959.
[20] B. Thambiraja, et al., 3DiFACE: Difusion-based speech-driven 3D facial animation and
editing, arXiv preprint arXiv:2312.00870 (2023). doi:https://doi.org/10.48550/arXiv.2312.
00870.
[21] H. Cao, et al., CREMA-D: Crowd-sourced emotional multimodal actors dataset, IEEE Transactions
on Afective Computing 5 (2014) 377–390. doi: 10.1109/TAFFC.2014.2336244.
[22] K. Vougioukas, et al., Realistic speech-driven facial animation with GANs, International Journal
of Computer Vision 128 (2020) 1398–1413.
[23] T. Kefalas, et al., Speech-driven facial animation using polynomial fusion of features, in: Proc.</p>
      <p>ICASSP, IEEE, 2020, pp. 3487–3491. doi:10.1109/ICASSP40776.2020.9054469.
[24] I. Grishchenko, et al., Mediapipe blendshape v2 model card, 2022. URL: https://storage.googleapis.</p>
      <p>com/mediapipe-assets/Model%20Card%20Blendshape%20V2.pdf.
[25] S. Chen, et al., WavLM: Large-scale self-supervised pre-training for full stack speech processing,</p>
      <p>IEEE Journal of Selected Topics in Signal Processing 16 (2022) 1505–1518.
[26] J. C. Vásquez-Correa, et al., Real-time speech-driven avatar animation by predicting facial
landmarks and deformation blendshapes, in: Proceedings of ICNLSP, 2024, pp. 109–118.
[27] S. Tomar, Converting video formats with FFmpeg, Linux Journal 2006 (2006) 10.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jehma</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Akaraphattanawong,
          <article-title>VRChat as a virtual learning platform for enhancing english listening skills</article-title>
          ,
          <source>International Journal of Information and Education Technology</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>813</fpage>
          -
          <lpage>817</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          , et al.,
          <article-title>Factors afecting avatar customization behavior in virtual environments</article-title>
          ,
          <source>Electronics</source>
          <volume>12</volume>
          (
          <year>2023</year>
          )
          <fpage>2286</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Noisri</surname>
          </string-name>
          , et al.,
          <article-title>Designing avatar system and integrate to the metaverse</article-title>
          , in: 2024
          <source>International Technical Conference on Circuits/Systems</source>
          , Computers, and
          <string-name>
            <surname>Communications (ITC-CSCC)</surname>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.-M.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-G.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>A metaverse: Taxonomy, components, applications, and open challenges</article-title>
          ,
          <source>IEEE Access 10</source>
          (
          <year>2022</year>
          )
          <fpage>4209</fpage>
          -
          <lpage>4251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Abdelaziz</surname>
          </string-name>
          , et al.,
          <article-title>Audiovisual speech synthesis using Tacotron2</article-title>
          ,
          <source>in: Proc. International Conference on Multimodal Interaction</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>503</fpage>
          -
          <lpage>511</lpage>
          . doi:
          <volume>10</volume>
          .1145/3462244.3479883.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ardila</surname>
          </string-name>
          , et al.,
          <article-title>Common voice: A massively-multilingual speech corpus</article-title>
          ,
          <source>in: Proceedings of LREC, European Language Resources Association</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>4218</fpage>
          -
          <lpage>4222</lpage>
          . URL: https://aclanthology. org/
          <year>2020</year>
          .lrec-
          <volume>1</volume>
          .520/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , et al.,
          <article-title>BERT: Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proc. Conference of the North American Chapter of the ACL</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https: //aclanthology.org/N19-1423/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Etchegoyhen</surname>
          </string-name>
          , et al.,
          <source>Neural machine translation of Basque, in: Proceedings of the European Association for Machine Translation</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>168</lpage>
          . URL: https://aclanthology.org/
          <year>2018</year>
          . eamt-main.
          <volume>14</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          , et al.,
          <article-title>The Kaldi speech recognition toolkit</article-title>
          ,
          <source>in: Proceedings of ASRU, IEEE Signal Processing Society</source>
          ,
          <year>2011</year>
          . IEEE Catalog No.
          <article-title>: CFP11SRW-USB.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Natural</surname>
            <given-names>TTS</given-names>
          </string-name>
          <article-title>synthesis by conditioning Wavenet on Mel spectrogram predictions</article-title>
          ,
          <source>in: Proceedings of ICASSP</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4779</fpage>
          -
          <lpage>4783</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP.
          <year>2018</year>
          .
          <volume>8461368</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cho</surname>
          </string-name>
          , et al.,
          <article-title>Learning speaker embedding from text-to-speech</article-title>
          ,
          <source>in: Interspeech</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>3256</fpage>
          -
          <lpage>3260</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kong</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>HiFi-GAN</surname>
          </string-name>
          :
          <article-title>Generative adversarial networks for eficient and high fidelity speech synthesis</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>17022</fpage>
          -
          <lpage>17033</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Expressive speech-driven facial animation with controllable emotions</article-title>
          ,
          <source>arXiv preprint arXiv:2301</source>
          .
          <year>02008</year>
          (
          <year>2023</year>
          ).
          <source>doi:10.1109/ICMEW59549</source>
          .
          <year>2023</year>
          .
          <volume>00073</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>