<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Tailoring Furhat robotic head lip-syncing to Galician language: an adaptation and evaluation study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Carla Castedo</string-name>
          <email>carlacastedo.pereira@usc.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carmen Magariños</string-name>
          <email>mariadelcarmen.magarinos@usc.gal</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro Catala</string-name>
          <email>alejandro.catala@usc.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Bugarín-Diz</string-name>
          <email>alberto.bugarin@usc.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS)</institution>
          ,
          <addr-line>Universidade de Santiago de Compostela</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Departamento de Electrónica e Computación, Universidade de Santiago de Compostela</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Instituto da Lingua Galega, Universidade de Santiago de Compostela</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Text-speech alignment and lip-syncing are crucial for a pleasant interaction with an embodied conversational agent, especially for anthropomorphic social robots like Furhat. While pre-trained alignment models are integrated into these agents, they may not fully align the target language, as they are primarily trained in English. This paper addresses this limitation for Galician, leveraging a previously developed text-to-speech system, the Furhat robot, and the Montreal Forced Aligner (MFA). We create acoustic models and a pronunciation dictionary for Galician from scratch, which is a key contribution given the lack of resources. We propose an alternative method using MFA to generate accurate phone-level alignments for Galician synthetic speech and evaluate its quality through objective and subjective experiments. In this preliminary study, our trained model's accuracy in misalignment assessment matches results reported in the literature for other languages despite the limited data availability for Galician. Regarding the subjective evaluation, a perceptual test with native speakers reveals a strong preference (88%) for our lip synchronization over Furhat's default (2%), highlighting the validity of our method for improving lip-syncing in under-resourced languages.</p>
      </abstract>
      <kwd-group>
        <kwd>lip-syncing</kwd>
        <kwd>forced alignment</kwd>
        <kwd>social robot</kwd>
        <kwd>Furhat</kwd>
        <kwd>Galician language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The advent of artificial intelligence has led to significant advancements in the field of social robots
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ], which are characterized by their ability to interact and communicate with humans. The
evolution of language technologies, particularly speech technologies, has been crucial in this progress,
enabling robots to interact more naturally and efectively. These technologies include automatic speech
recognition (ASR), text-to-speech synthesis (TTS), and dialogue systems, which empower robots to
understand and respond to verbal commands, as well as engage in fluid conversations. Furthermore,
many of these robots incorporate multimodal interaction capabilities, such as face recognition and face
tracking systems, and enrich communication with the user through gestures and facial expressions.
      </p>
      <p>
        Furhat [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a prime example of a cutting-edge social robot that has the capacity to modify its facial
appearance and expressions through its innovative projected mask design. This technology enables it
to interact with multiple people simultaneously in a multimodal approach, using verbal and non-verbal
cues like speech, real-time face tracking, facial analysis, lip-synced facial animation, gestures, and eye
and head movements, creating mixed-initiative conversations. In addition, Furhat integrates the most
advanced ASR and TTS systems, supporting more than 40 languages and over 200 diferent voices.
      </p>
      <p>However, to ensure a fully satisfying experience when interacting with social robots that feature
human-like faces, attention must be paid to both the quality of synthesized speech and the
synchronization of facial expressions. In particular, accurate lip synchronization is crucial for natural and fluid
robot-human interaction, as it directly impacts the perceived intelligibility and expressiveness of the
robot’s speech. This synchronization must be carefully aligned with the prosody of the synthesized
audio and the specific phonetic features of the target language.</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>In this regard, Furhat’s lip synchronization faces limitations due to its reliance on the Microsoft
Universal Phone Set (UPS)1, which is primarily designed for American English. This means Furhat
cannot accurately reproduce lip movements for phones absent in the UPS, which is a significant
issue when working with other languages. Moreover, integrating external TTS systems that lack
the necessary information for lip synchronization (i.e. phone sequence and timestamps) presents an
additional challenge. Furhat defaults to automatic lip-syncing based on phone audio recognition in
such cases. This automatic process, while functional, is suboptimal, as the built-in phone recognizer
is trained on English. Therefore, for languages other than English, the resulting lip movements are
frequently inaccurate, further diminishing the naturalness and realism of Furhat’s speech.</p>
      <p>
        This work addresses the limitations of Furhat’s default lip synchronization when integrating an
external TTS system in the Galician language. We propose an alternative method that leverages the
Montreal Forced Aligner (MFA) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a robust forced alignment tool based on a classical Hidden Markov
Model-Gaussian mixture model (HMM-GMM) architecture, to generate accurate phone-level alignments
for Galician speech. We evaluate this method using the Celtia voice of the Nós-TTS system 2 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], a
high-quality Galician synthetic voice developed within the Nós project [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. A key challenge overcome
in this work is the creation of acoustic models and a pronunciation dictionary for Galician, as, to the
best of our knowledge, no pre-existing resources were available. This study aims to bridge this gap,
enabling the integration of MFA with Furhat and evaluating its performance in achieving accurate
lip synchronization for Galician speech and improving user perception. Evaluation will be performed
through objective measures and a perceptual preference test conducted by native speakers.
      </p>
      <p>The remainder of this paper is structured as follows: Section 2 reviews background on Furhat’s
lip-syncing and forced alignment; Section 3 outlines the proposed system architecture; Section 4 details
the employed methodology; Section 5 presents the evaluation results; and, finally, Section 6 discusses
the conclusions and future research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. Furhat’s lip-syncing</title>
        <p>Furhat’s operation relies on Kotlin-based skills, which manage both speech output and facial expressions.
Lip synchronization is achieved through the use of paired audio and alignment text files. Specifically,
16kHz 16-bit PCM WAV audio files are used, accompanied by corresponding Furhat-specific
JSONformatted text files ( .pho), which provide precise word- and phone-level alignments. Additionally,
each phone in a .pho file has an associated Boolean “prominent” field. This field is used to trigger a
MonitorSpeechProminent event, which can be used for co-speech gestures like raising the eyebrows.
However, the company does not provide specific code or criteria for its implementation.</p>
        <p>In Furhat’s speech generation process, the TTS system is expected to provide a .pho file for accurate
lip synchronization. If this file is missing, Furhat employs an automatic lip-sync mechanism based on
phonetic recognition of the input audio. While functional across languages, this automatic process is
optimized for American English, as its underlying model is primarily trained on English data.
Consequently, lip-syncing for other languages may be suboptimal, resulting in unnatural or inaccurate
lip movements that do not correspond to the actual pronunciation. Therefore, to achieve accurate
lip-syncing in other languages, a custom phonetic file that matches Furhat’s requirements must be
generated. For this purpose, a forced alignment tool may be used. The selected aligner should support
phone-level alignment, and the resulting phonetic transcriptions must be mapped to the Microsoft UPS
to ensure compatibility with Furhat.</p>
        <p>Furhat Robotics also provides a web application 3 that uses the MFA for alignment of American
English. Although this tool is not integrated for real-time operation with the robot, it has been used in
this work to analyze the structure of the .pho files generated by Furhat.
1As per 2024-06-30, when this study ran.
2https://tts.nos.gal
3https://furhat.io/audio</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Forced alignment</title>
        <p>
          Forced alignment techniques can be categorized into three main approaches: HTK-based, Kaldi-based,
and deep learning-based methods. The Hidden Markov Model Toolkit (HTK) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is an older yet widely
used framework, powering aligners like Prosodylab-Aligner [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and MAUS [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. While these ofer
robust performance, they have limited flexibility and a steep learning curve. Kaldi-based [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] methods,
exemplified by the MFA [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], overcome these limitations by ofering a more user-friendly interface, active
development, and more advanced acoustic modelling techniques. Deep learning-based methods, such
as Wav2Vec 2.0 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], employ self-supervised learning and Connectionist Temporal Classification (CTC)
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] to achieve high accuracy, often matching traditional methods performance [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ]. However,
they typically require pre-existing alignments for training. Although NeMo Forced Aligner (NFA) [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
has recently emerged as a promising alternative, it does not provide phone-level alignment, making it
unsuitable for Furhat’s needs. Ultimately, MFA was chosen for this study due to its high accuracy [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ],
adaptability, noise resistance, ease of use, and comprehensive documentation.
        </p>
        <p>MFA employs a traditional architecture that combines HMMs for sequence modelling with GMMs
for acoustic modelling. The training process involves extracting Mel-Frequency Cepstral Coeficients
(MFCC) from the audio and training phone models (monophones and triphones) using the
ExpectationMaximization (EM) algorithm. To improve accuracy, speaker adaptation techniques, such as Linear
Discriminant Analysis with a Maximum Likelihood Linear Transform (LDA+MLLT) and Speaker
Adaptive Training (SAT) with Feature-space Maximum Likelihood Linear Regression (fMLLR), are
applied. For alignment, the audio, the corresponding orthographic transcription, and a pronunciation
dictionary are required. MFA supports the custom training of acoustic models and pronunciation
dictionaries, facilitating the inclusion of new languages like Galician.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. System overview</title>
      <p>In the default configuration, the proxy service responds only to the audio request. Therefore, the
robot generates the .pho file internally. When using our proposed alignment method, the proxy service
requests the .pho file from the alignment service, sending both the input text and the synthesized
audio. The resulting .pho file is then sent to the robot, enabling enhanced lip synchronization. Our
system requires a custom-built MFA module, including all necessary resources for forced alignment.
Furthermore, a dedicated program, or “skill”, must be running on the Furhat robot to manage these
tasks.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Pronunciation dictionary construction</title>
        <p>
          The pronunciation dictionary was created using the established format for MFA dictionaries. Words
and conjugated verb forms were drawn for the Real Academia Galega dictionary [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and the Instituto
da Lingua Galega pronunciation dictionary [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
        <p>
          In addition, when validating a speech corpus with MFA, a list of out-of-vocabulary words (OOV)
is obtained. These words belong to the corpus but are not included in the dictionary. In order to
avoid them, the OOVs of all speech corpora that will be used later for the training of acoustic models
[
          <xref ref-type="bibr" rid="ref21 ref22 ref23 ref24 ref25">21, 22, 23, 24, 25</xref>
          ] were included in the pronunciation dictionary. These words were normalized and
phonetically transcribed using the text processing module of Cotovía [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], which serves as a
graphemeto-phoneme (G2P) system for the Galician language. Cotovía outputs use a special phone set designed
for the tool. For simplicity, it was taken as the dictionary phone set. It must be noted that pronunciation
probabilities were not considered, as the goal was the creation of a functional basic dictionary.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Acoustic model training</title>
        <p>
          A custom acoustic model was trained to enable MFA to perform forced alignment in Galician. This
process involved utilizing over 1,700 hours of speech data from various corpora. Due to the MFA’s
database size constraints, it was infeasible to train a model using the complete dataset. Therefore,
several combinations using diferent corpora were used for training. The best results were obtained
with the combination of Nós_Celtia-GL [
          <xref ref-type="bibr" rid="ref22 ref27">22, 27</xref>
          ], Common_Voice_GL_17.0 [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], Telexornais_LS (internal
use) and OpenSLR77 [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], with a total of 216 hours of audio and more than 4,000 diferent speakers as
shown in Table 1. The best results were determined based on the validation set described in Section 4.4
and the metrics detailed in Section 5.1.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Furhat’s phonetic file construction</title>
        <p>
          Because Furhat requires specific phonetic files for lip synchronization, the TextGrid format output
by MFA, containing word- and phone-level alignments, must be converted to Furhat’s .pho format.
Furthermore, as previously mentioned, Furhat’s phone articulation relies on the UPS. Although Furhat
allows recording new gestures, it does not support modifications to its articulation phone set.
Consequently, to generate a functional .pho file, a mapping is required between the Cotovía phone set (used
for alignments) and the Furhat phone set. This mapping is performed via the International Phonetic
Alphabet (IPA) [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], as detailed in Table 2. Due to incomplete equivalence between the phone sets,
some phone approximations (highlighted in bold red) were made with the assistance of a phonetics
expert. As for the prominent field, given the lack of information, we used the tonic syllable of each
word as the criterion.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Evaluation</title>
        <p>
          In order to check whether the trained MFA model was suitable for the alignment of synthetic speech,
a set of five sentences (included in Appendix A) was designed, considering particularly challenging
articulations. These sentences, synthesized using the Celtia voice of the Nós-TTS system and manually
aligned using Praat [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], will serve as a gold standard.
        </p>
        <p>Using this gold standard as a reference, we performed an objective evaluation considering alignment
errors between the timestamps predicted by the trained model and the manually annotated timestamps
(at both word and phone levels). We calculated these errors’ mean, standard deviation, and median to
assess model performance and compare our results with those reported in the literature. The selected
model is the one with the lowest statistic values, trained on the corpora specified in Section 4.2.</p>
        <p>After validation, the alignment model was integrated into the proposed system. A subjective
evaluation, by means of a perceptual preference test, was then conducted to compare the perception of
the default lip-syncing and that obtained with our system. The test consisted of 10 pairs of stimuli,
each corresponding to a diferent sentence. Within each pair, one stimulus featured Furhat’s default
lip-syncing, while the other showcased the synchronization generated by the proposed method. The
stimuli in each pair were presented in random order to the participants. The selection of the 10 sentences,
which comprised the 5 sentences from the objective evaluation, was guided by a phonetics expert. The
sentences were designed to cover a range of expressiveness, including declarative, interrogative, and
exclamatory forms. For each pair, participants were asked to indicate their preference by answering the
question ‘Which of the stimuli is more natural and synchronized with the audio?” The answer options
were: “Stimulus A”, “Stimulus B” and “I can’t decide”. Participants were allowed to ask for the stimulus
of their choice to be replayed as many times as they wished. It is worth mentioning that participants
experienced the stimuli 4 directly through the robotic head. The total task duration was approximately
20 minutes, and the test was conducted in a room free from noise interference.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Automatic misalignment assessment</title>
        <p>The evaluation of the Galician MFA-trained acoustic model was conducted by comparing its
automatically generated alignments against the gold standard. Key metrics, including mean, median, and
standard deviation, were calculated for both word- and phone-level alignment errors and are shown in
Table 3.</p>
        <p>
          At the word level, it can be observed that the results are consistent with those obtained in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] in
terms of mean, median and standard deviation. In that work, the authors used pre-trained acoustic
models from the oficial MFA website. Therefore, the fact that our trained Galician model replicates
their results suggests it performs comparably to other models provided by MFA in their acoustic model
bank.
4Videorecordings are included in the following link to illustrate how the stimuli looked like. https://nextcloud.citius.usc.es/s/
TNpGESXxnq3bmzB
        </p>
        <p>
          Concerning the errors at the phone level, our results are comparable to those reported by the authors
of MFA [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] in terms of mean and median. This represents a significant achievement, considering the
greater availability of high-quality, large-scale corpora for American English. The close results, despite
limited access to corpora, suggest strong model performance.
        </p>
        <p>It is worth remarking on the diference between the mean and median values, especially at the word
level, as depicted in the boxplots in Figure 2a. As can be observed, while most data points cluster around
the median for both word- and phone-level errors, a couple of outliers, deviating by more than one
second, substantially increase the mean. These values, located around the same point, indicate that
this large diference in phone alignment accumulates at the word level. Further investigation of the
corresponding audio files revealed that these outliers stem from silence misalignments.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Perceptual preference test</title>
        <p>Five adult Galician native speakers, all women aged 20-40 with normal or corrected-to-normal vision,
participated in the test. Four of them worked in language technologies, with backgrounds in fields such
as linguistics, speech technologies, and computational linguistics. Three of these had prior experience
with either Galician speech synthesis, the Furhat robot, or both. The fith participant was completely
unfamiliar with the technologies under consideration. Figure 2b shows the obtained preference scores
for the default Furhat synchronization, the trained model and the situation of no preference for either,
along with the corresponding 95% confidence intervals. Trained model synchronization was chosen in
88% of the cases, while in only 2% of them, the default Furhat synchronization was chosen. Moreover,
confidence intervals do not overlap, revealing significant diferences, even for such a low number of
participants.</p>
        <p>
          For 8 out of the 10 sentences, the preference was unanimous for the trained model. Moreover, in
only one of the sentences, most participants did not choose the trained model. This sentence is also one
of those considered in the previous objective evaluation, and it is the one that generated the outliers
in the boxplots. When the participants were asked about this sentence, they stated that they were
puzzled to see the robot moving its lips in a moment of silence. Thus, improved management of silences
during model training could further increase the preference for the trained model. As reported by
other authors in the literature [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], the lack of explicitly encoded silences in transcriptions poses a
challenge for forced alignment, often leading to misalignment during long silent intervals. This issue
could be mitigated by using the pause information provided by Cotovía to encode long silences in the
transcription.
        </p>
        <p>In addition, the phonetics expert was asked to provide her overall assessment of the synchronization.
She stated that she perceived a significant improvement in the phone articulation when using the
proposed method and even noted an increase in the robot’s facial expressiveness. While further
investigation is necessary, this enhancement in expressiveness may be related to the influence of the
prominent field.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work we have explored an alternative form of lip-syncing for the social robot Furhat in its
Galician speech by using Nós-TTS and forced alignment tools. Specifically, MFA was used to train an
acoustic model for Galician completely from scratch, as well as build the pronunciation dictionary.
(a) Alignment errors at word and phone level
(b) Perceptual preference test results</p>
      <p>Automatic misalignment assessment demonstrates that our trained model is able to match, at the
word level, the results obtained by pre-trained models. At the phone level, its performance is close to
that reported for American English by the MFA developers.</p>
      <p>The perceptual preference test shows the positive results achieved by the system with the integrated
trained model. It is clear that the participants perceived better lip synchronization for the integrated
system and that by adjusting the fields of the pronunciation file, it has been possible to gain
expressiveness in facial expressions. These results are promising and indicate the feasibility of conducting further
formal system validation with more participants. Additionally, it would be interesting to conduct a more
detailed evaluation of the robot expressiveness associated to lip-syncing in particular social applications,
where incorporating validated questionnaires or biometrics measures (e.g., eye-tracking, galvanic skin
response) could ofer more objective and in-depth insights on the interactions.</p>
      <p>In light of the encouraging initial results reported in this paper, we intend to expand the scope of
this study. Future work could evaluate the performance obtained by a probability-based pronunciation
dictionary and explore alternative forced alignment systems, considering more recent deep
learningbased models. Furthermore, we plan to enhance the implementation and deployment of the developed
method to ensure lower latency, making it suitable for non-scripted scenarios outside the lab. Finally,
given the importance of expressiveness, future studies could investigate alternative methods for defining
the prominent field in order to achieve more natural and realistic gesticulation.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was funded by the Ministry for Digital Transformation and Civil Service and the Recovery,
Transformation and Resilience Plan - Funded by EU – NextGenerationEU within the framework of the
projects ”Desarrollo Modelos ALIA” and NEL-NÓS (ref. 2022/TL22/00215336). This research is also
supported by projects PID2020-112623GB-I00, PID2021-123152OB-C21, and CNS2024-154915 funded
by MCIN/AEI/10.13039/501100011033/ and by ERDF A way of making Europe. The support of the
Galician Ministry for Education, Universities and Professional Training and the “ERDF A way of making
Europe” is also acknowledged through grants “Centro de investigación de Galicia accreditation
20242027 ED431G-2023/04” and “Reference Competitive Group accreditation 2022-2025 ED431C 2022/19”.
Special thanks are extended to Albina Sarymsakova and Noelia García Díaz for their expert support on
specific phonetic tasks.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this manuscript, the authors used GPT-4, Gemini 2.0 Flash, and DeepL for
text translation, grammar and spelling verification, and paraphrasing and rephrasing. After using
these tools, the authors reviewed and edited the content as needed and take full responsibility for the
publication’s content.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Set of Sentences Used for Evaluation</title>
      <sec id="sec-9-1">
        <title>Objective Evaluation</title>
        <p>1. Partiu a leña do souto.
2. Vén aquí para velo ben.
3. Que dis? É difícil, non?
4. A cadela desenterrou os ósos.
5. Cala, ho! Non deixas escoitar.</p>
      </sec>
      <sec id="sec-9-2">
        <title>Additional Sentences for Perceptive Test</title>
        <p>6. Unha curuxa! Berrou o vello.
7. Esa muller e o seu home son meus veciños.
8. O proxecto foi nado na internet galaica.
9. Choveu moito onte en Santiago.
10. Centos de veces merquei ese xornal.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fong</surname>
          </string-name>
          , I. Nourbakhsh,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dautenhahn</surname>
          </string-name>
          ,
          <article-title>A survey of socially interactive robots</article-title>
          ,
          <source>Robotics and Autonomous Systems</source>
          <volume>42</volume>
          (
          <year>2003</year>
          )
          <fpage>143</fpage>
          -
          <lpage>166</lpage>
          . doi:
          <volume>10</volume>
          .1016/S0921-
          <volume>8890</volume>
          (
          <issue>02</issue>
          )
          <fpage>00372</fpage>
          -
          <lpage>X</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Henschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Laban</surname>
          </string-name>
          , E. Cross,
          <article-title>What Makes a Robot Social? A Review of Social Robots from Science Fiction to a Home or Hospital Near You</article-title>
          ,
          <source>Current Robotics Reports</source>
          <volume>2</volume>
          (
          <year>2021</year>
          ).
          <source>doi:10.1007/ s43154- 020- 00035- 0.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mahdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Akgun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saleh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dautenhahn</surname>
          </string-name>
          ,
          <article-title>A survey on the design and evolution of social robots - Past, present and future</article-title>
          ,
          <source>Robotics and Autonomous Systems</source>
          <volume>156</volume>
          (
          <year>2022</year>
          )
          <fpage>104193</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Al Moubayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Beskow</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Skantze,</surname>
          </string-name>
          <article-title>The furhat social companion talking head</article-title>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2013</year>
          ,
          <year>2013</year>
          , pp.
          <fpage>747</fpage>
          -
          <lpage>749</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>McAulife</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Socolof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mihuc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sonderegger</surname>
          </string-name>
          , Montreal Forced Aligner:
          <article-title>Trainable Text-Speech Alignment Using Kaldi</article-title>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2017</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>498</fpage>
          -
          <lpage>502</lpage>
          . doi:
          <volume>10</volume>
          .21437/Interspeech.2017-
          <volume>1386</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Magariños</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Öktem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Moscoso</given-names>
            <surname>Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Vázquez</given-names>
            <surname>Abuín</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. García</given-names>
            <surname>Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Vladu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Fernández</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Baqueiro Vidal, Nós-TTS: a Web User Interface for Galician Text-to-</article-title>
          <string-name>
            <surname>Speech</surname>
          </string-name>
          ,
          <source>in: Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol</source>
          .
          <volume>2</volume>
          , Association for Computational Linguistics, Santiago de Compostela, Galicia, Spain,
          <year>2024</year>
          , p.
          <fpage>200</fpage>
          -
          <lpage>203</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Vladu</surname>
          </string-name>
          , I. de Dios-Flores,
          <string-name>
            <given-names>C.</given-names>
            <surname>Magariños</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Pichel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gamallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Fernández</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bugarín</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. González</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Barro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Regueira</surname>
          </string-name>
          , Proxecto Nós:
          <article-title>Artificial intelligence at the service of the Galician language</article-title>
          , in: SEPLN-PD
          <year>2022</year>
          .
          <article-title>Annual Conference of the Spanish Association for Natural Language Processing 2022: Projects and</article-title>
          <string-name>
            <surname>Demonstrations</surname>
            ,
            <given-names>A Coruña</given-names>
          </string-name>
          ,
          <year>Spain</year>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>I. de Dios-Flores</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Magariños</surname>
            ,
            <given-names>A. I.</given-names>
          </string-name>
          <string-name>
            <surname>Vladu</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          <string-name>
            <surname>Ortega</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <string-name>
            <surname>Pichel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Garcia</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Gamallo</surname>
            ,
            <given-names>E. Fernández</given-names>
          </string-name>
          <string-name>
            <surname>Rei</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bugarín</surname>
            ,
            <given-names>M. González</given-names>
          </string-name>
          <string-name>
            <surname>González</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Barro</surname>
            ,
            <given-names>X. L.</given-names>
          </string-name>
          <string-name>
            <surname>Regueira</surname>
          </string-name>
          , The Nos'
          <article-title>Project: Opening routes for the Galician language in the field of language technologies</article-title>
          ,
          <source>in: Proceedings of the TDLE Workshop LREC2022</source>
          , European Language Resources Association (ELRA),
          <year>Marseille</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <source>The HTK Hidden Markov Model Toolkit: Design and Philosophy</source>
          , Entropic Cambridge Research Laboratory, Ltd
          <volume>2</volume>
          (
          <year>1994</year>
          )
          <fpage>2</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Gorman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Howell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <article-title>Prosodylab-aligner: A tool for forced alignment of laboratory speech</article-title>
          ,
          <source>in: Canadian Acoustics</source>
          , volume
          <volume>39</volume>
          ,
          <year>2011</year>
          , pp.
          <fpage>192</fpage>
          -
          <lpage>193</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Schiel</surname>
          </string-name>
          ,
          <article-title>Automatic Phonetic Transcription of Non-Prompted Speech</article-title>
          ,
          <source>in: Proceedings of the 14th International Congress of Phonetic Sciences (ICPhS)</source>
          , volume
          <volume>1</volume>
          ,
          <year>1999</year>
          , pp.
          <fpage>607</fpage>
          -
          <lpage>610</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghoshal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Boulianne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Burget</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Glembek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hannemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Motlicek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Silovsky</surname>
          </string-name>
          , G. Stemmer,
          <string-name>
            <given-names>K.</given-names>
            <surname>Vesely</surname>
          </string-name>
          ,
          <article-title>The Kaldi Speech Recognition Toolkit</article-title>
          ,
          <source>in: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding</source>
          ,
          <source>IEEE Signal Processing Society</source>
          ,
          <year>2011</year>
          . IEEE Catalog No.
          <article-title>: CFP11SRW-USB.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec
          <volume>2</volume>
          .
          <article-title>0: A Framework for Self-Supervised Learning of Speech Representations</article-title>
          ,
          <source>in: Proceedings of the 34th Int. Conf. Neural Information Processing Systems</source>
          , NIPS '20, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks</article-title>
          ,
          <source>in: Proceedings of the 23rd International Conference on Machine Learning</source>
          , ICML '06,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2006</year>
          , p.
          <fpage>369</fpage>
          -
          <lpage>376</lpage>
          . doi:
          <volume>10</volume>
          .1145/1143844.1143891.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Biczysko</surname>
          </string-name>
          ,
          <article-title>Automatic Annotation of Speech: Exploring Boundaries within Forced Alignment for Swedish and Norwegian, Master's thesis</article-title>
          , Uppsala University, Dep.
          <source>Linguistics and Philology</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , D. Jurgens,
          <article-title>Phone-to-Audio Alignment without Text: A Semi-Supervised Approach</article-title>
          , in: ICASSP 2022
          <article-title>-</article-title>
          2022 IEEE International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>8167</fpage>
          -
          <lpage>8171</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP43922.
          <year>2022</year>
          .
          <volume>9746112</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rastorgueva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lavrukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ginsburg</surname>
          </string-name>
          ,
          <article-title>NeMo Forced Aligner and its application to word alignment for subtitle generation</article-title>
          ,
          <source>in: Proc. INTERSPEECH</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>5257</fpage>
          -
          <lpage>5258</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rousso</surname>
          </string-name>
          , E. Cohen,
          <string-name>
            <given-names>J.</given-names>
            <surname>Keshet</surname>
          </string-name>
          , E. Chodrof,
          <article-title>Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment</article-title>
          ,
          <source>in: Interspeech</source>
          <year>2024</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>1525</fpage>
          -
          <lpage>1529</lpage>
          . doi:
          <volume>10</volume>
          .21437/ Interspeech.2024-
          <volume>429</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Real</given-names>
            <surname>Academia</surname>
          </string-name>
          <string-name>
            <surname>Galega</surname>
          </string-name>
          ,
          <source>Dicionario da Real Academia Galega</source>
          ,
          <year>2024</year>
          . URL: https://academia.gal/ dicionario, [
          <source>Accesed</source>
          <year>2024</year>
          -
          <volume>06</volume>
          -30].
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Instituto da Lingua</surname>
            <given-names>Galega</given-names>
          </string-name>
          , Universidade de Santiago de Compostela, Dicionario de Pronuncia da Lingua Galega,
          <year>2024</year>
          . URL: http://ilg.usc.es/pronuncia/,
          <source>[Accesed</source>
          <year>2024</year>
          -
          <volume>06</volume>
          -30].
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ardila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Branson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Henretty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kohler</surname>
          </string-name>
          , J. Meyer,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saunders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Tyers</surname>
          </string-name>
          , G. Weber,
          <string-name>
            <surname>Common Voice</surname>
            :
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Massively-Multilingual Speech</surname>
          </string-name>
          Corpus,
          <source>in: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC</source>
          <year>2020</year>
          ),
          <year>2020</year>
          , pp.
          <fpage>4211</fpage>
          -
          <lpage>4215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vázquez Abuín</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. García</given-names>
            <surname>Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Vladu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Magariños</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Vidal</given-names>
            <surname>Miguéns</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Fernández Rei</surname>
          </string-name>
          , Nos_Celtia-GL:
          <article-title>Galician TTS corpus</article-title>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.7716958, dataset.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Magariños</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Miguéns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Vladu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. G.</given-names>
            <surname>Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Abuín</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Couso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bardanca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Rei</surname>
          </string-name>
          , Nos_ParlaSpeech-GL:
          <article-title>Galician ASR corpus</article-title>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.7913218, dataset.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>O.</given-names>
            <surname>Kjartansson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gutkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Butryna</surname>
          </string-name>
          , I. Demirsahin,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rivera</surname>
          </string-name>
          ,
          <article-title>Open-Source High Quality Speech Datasets for Basque, Catalan and Galician</article-title>
          ,
          <source>in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies</source>
          for
          <article-title>Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association (ELRA), Marseille</article-title>
          , France,
          <year>2020</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>C.</given-names>
            <surname>Canosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Francisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moscoso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. González</given-names>
            <surname>Corbelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Magariños</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Vladu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Fernández</given-names>
            <surname>López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Fernández</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dubert-García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Regueira</surname>
          </string-name>
          ,
          <article-title>Nos_RG-Podcast-</article-title>
          <string-name>
            <surname>GL</surname>
          </string-name>
          ,
          <year>2025</year>
          . URL: https://huggingface.co/datasets/proxectonos/Nos_RG-Podcast-GL,
          <source>[Accesed</source>
          <year>2024</year>
          -
          <volume>06</volume>
          -30].
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodríguez Banga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>García-Mateo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Méndez-Pazó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>González-González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Magariños</surname>
          </string-name>
          ,
          <article-title>Cotovía: an open source TTS for Galician and Spanish, in: VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop</article-title>
          , IberSPEECH,
          <year>2012</year>
          , pp.
          <fpage>308</fpage>
          -
          <lpage>315</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>N. García</given-names>
            <surname>Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Vázquez</given-names>
            <surname>Abuín</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Magariños</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Vladu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Moscoso</given-names>
            <surname>Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Fernández Rei</surname>
          </string-name>
          ,
          <article-title>Nos_Celtia-GL: an Open High-Quality Speech Synthesis Resource for Galician</article-title>
          ,
          <source>in: IberSPEECH</source>
          <year>2024</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>95</lpage>
          . doi:
          <volume>10</volume>
          .21437/IberSPEECH.2024-
          <volume>19</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>International</given-names>
            <surname>Phonetic</surname>
          </string-name>
          <string-name>
            <surname>Association</surname>
          </string-name>
          ,
          <source>International Phonetic Alphabet Chart</source>
          ,
          <year>2020</year>
          . URL: https: //www.internationalphoneticassociation.org/content/ipa-chart,
          <year>accessed 2024</year>
          -
          <volume>06</volume>
          -24.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>P.</given-names>
            <surname>Boersma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weenink</surname>
          </string-name>
          ,
          <article-title>Praat: doing phonetics by computer</article-title>
          , https://www.fon.hum.uva.nl/praat/,
          <year>1992</year>
          . [Accesed 2024-
          <volume>06</volume>
          -30, version 6.4.13].
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , D. Jurgens,
          <article-title>Phone-to-audio alignment without text: A semi-supervised approach</article-title>
          ,
          <year>2022</year>
          , pp.
          <fpage>8167</fpage>
          -
          <lpage>8171</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP43922.
          <year>2022</year>
          .
          <volume>9746112</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>