Proxecto Nós: Artificial intelligence at the service of the Galician
language
Proxecto Nós: Inteligencia artificial al servicio de la lengua gallega

Adina Ioana Vladu 1, Iria de-Dios-Flores 2, Carmen Magariños 1, John E. Ortega 2, José Ramom
Pichel 2, Marcos Garcia 2, Pablo Gamallo 2, Elisa Fernández Rei 1, Alberto Bugarín 2, Manuel
González González 1, Senén Barro 2 and Xosé Luis Regueira 1
1
 Instituto da Lingua Galega (ILG) - Universidade de Santiago de Compostela, Spain
2
  Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS) - Universidade de Santiago de
Compostela, Spain


                   Abstract
                   Proxecto Nós is an initiative aimed at providing the Galician language with openly licensed
                   resources, tools, and demonstrators in the area of intelligent technologies. The Project has two
                   main scientific and technological objectives: (i) to integrate the Galician language into cutting-
                   edge AI and language technologies, thus enabling the natural use of Galician in human-machine
                   interactions; and (ii) to improve the state of the art of language technologies for Galician.

                   Keywords 1
                   Language technologies, linguistic rights, Galician, low-resource languages.


1. Introduction                                                                                     of Santiago de Compostela (USC) and is currently
                                                                                                    being carried out by a research team comprising
                                                                                                    members of the Instituto da Lingua Galega (ILG)
    Proxecto Nós (The Nós Project) is an initiative
                                                                                                    and the Centro Singular de Investigación en
promoted by the Galician Government (Xunta de
                                                                                                    Tecnoloxías Intelixentes (CiTIUS). The first
Galicia), aimed at providing the Galician
                                                                                                    stage, spanning from the final trimester of 2021 to
language with openly licensed resources, tools,
                                                                                                    2025, will lay the foundations and provide the
demonstrators, and use cases in the area of
                                                                                                    resources that will help place Galician among the
intelligent technologies. The execution of
                                                                                                    languages that are fully active in the digital
Proxecto Nós has been entrusted to the University
                                                                                                    society and economy.

SEPLN-PD 2022. Annual Conference of the Spanish Association for
Natural Language Processing 2022: Projects and Demonstrations,
September 21-23, 2022, A Coruña, Spain
EMAIL: adina.vladu@usc.gal (A.I. Vladu); iria.dedios@usc.gal
(I. de-Dios-Flores); mariadelcarmen.magarinos@usc.gal (C.
Magariños);        john.ortega@usc.gal         (J.     Ortega);
jramom.pichel@usc.gal                (J.R.             Pichel);
marcos.garcia.gonzalez@usc.gal             (M.         Garcia);
pablo.gamallo@usc.gal (P. Gamallo); elisa.fernandez@usc.gal (E.
Férnandez Rei); alberto.bugarin.diz@usc.gal (A. Bugarín);
manuel.gonzalez.gonzalez@usc.gal (M. González González)
senen.barro@usc.gal (S. Barro); xoseluis.regueira@usc.gal (X.L.
Regueira)
ORCID: 0000-0002-3910-7820 (A.I. Vladu); 0000-0002-5941-
1707 (I. de-Dios-Flores); 0000-0003-3525-1304 (C. Magariños);
0000-0002-2328-3205 (J. Ortega); 0000-0001-5172-6803 (J.R.
Pichel); 0000-0002-6557-0210 (M. Garcia); 0000-0002-5819-
2469 (P. Gamallo); 0000-0002-4109-0087 (E. Férnandez Rei);
0000-0003-3574-3843 (A. Bugarín); 0000-0001-7948-4607 (M.
González González) 0000-0001-6035-540X (S. Barro); 0000-
0001-7264-3740 (X.L. Regueira)
               ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative
               Commons License Attribution 4.0 International (CC BY 4.0).

               CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                               26
2. Context and motivation                                     process enormous quantities of documents. In the
                                                              case of Galician, incorporating the language into
                                                              state-of-the-art AI applications can not only
    The development of language technologies is
                                                              significantly favor its prestige (a decisive factor in
a strategic innovation area geared towards the
                                                              language normalization), but also guarantee
digital society and economy, and it has been a
                                                              citizens’ language rights and reduce social
priority in both Spanish (Plan Estatal de
                                                              inequality.
Investigación Científica y Técnica y de
                                                                  In economic terms, the global Natural
Innovación, Estrategia Española de Ciencia y
                                                              Language Processing (NLP) market size was
Tecnología y de Innovación) and European
                                                              valued at more than USD 10 billion in 2020 and
(Horizon 2020) scientific planning. Technologies
                                                              is expected to reach USD 41 billion by 2025
such as machine translation (MT), information
                                                              (Aldabe et al., 2021). NLP technologies are used
extraction (IE), text analytics, and dialogue
                                                              in different areas such as information retrieval,
systems are essential in the digital society, culture,
                                                              MT, IE (with notable growth in its application in
and economy.
                                                              the medical domain during the Covid-19
    Languages in high demand worldwide
                                                              pandemic), dialogue systems, and automatic text
(especially English) benefit from a large variety
                                                              generation, among many others. The capacity to
of computational resources that can contribute to
                                                              model language, an essential ability for human
developing new automatic language processing
                                                              beings, ensures a promising future for such
technologies and tools. Such is the case due to the
                                                              technologies from both an economic and research
long-standing research tradition in these areas
                                                              and innovation perspective.
(e.g., the variety of projects financed by USA’s
DARPA) and the need to incorporate such
languages into the AI applications associated with            3. State of the art: Galician resources
the latest electronic devices (such as the                       and technologies
conversational AI or automatic dictation software
developed by Google, Amazon or Apple). Other
                                                                  In 2012, the White Paper The Galician
languages that have joined AI research later, such
                                                              Language in the Digital Age (García-Mateo et al.,
as Chinese, are currently following in the
                                                              2012) described Galician as a language with a
footsteps of English, through projects such as
                                                              level of technological support that “gives rise to
Baidu’s Qian Yan, which improve significantly
                                                              cautious optimism”, while highlighting the need
the computational resources available in their
                                                              for new resources and tools. Previous research
respective language varieties.
                                                              projects on Galician resulted in speech processing
    Notwithstanding, language technologies are
                                                              resources (COTOVÍA), an annotated reference
also necessary for languages in lower
                                                              corpus (CORGA), morphosyntactic lemmatizers
international demand. Consequently, different
                                                              and taggers (XIADA, FreeLing, IXA-Pipes),
languages have developed similar initiatives to
                                                              other specialized corpora, both text (CLUVI,
Nós. Among others, we can highlight Projecte
                                                              CTG. TreeGal) and speech (CORILGA, AGO),
AINA, which will develop computational
                                                              MT systems (GAIO, OpenTrad), spellcheckers
resources for Catalan until 2024, or the work
                                                              (OrtoGal), grammar checkers (Avalingua),
carried out at the HiTZ Research Center, focusing
                                                              language analysis and IE tools (Linguakit),
on languages technologies for Basque. Other
                                                              language models (SemantiGal, Bertinho), and
projects, such as CorCenCC (in Great Britain, for
                                                              other resources.
Welsh) or UQAILAUT (in Canada, for Inuktitut)
                                                                  Furthermore, Galician is currently part of
were considered success cases in the promotion of
                                                              multilingual crowdsourced data collection
the digital use of socially threatened languages.
                                                              initiatives carried out by important companies on
    The democratization of language technologies
                                                              the global IT market, which have resulted in
has a great social and cultural impact on the
                                                              speech databases such as Google’s SLR77
communities that use them. For instance, MT
                                                              (Kjartansson et al., 2020) and Mozilla’s
increases access to contents in different
                                                              CommonVoice 7.0 and 8.0 (Ardila et al., 2020).
languages, thus facilitating intercultural relations;
                                                              This situation is reflected in a recent report on the
dialogue systems allow us to communicate with
                                                              current state of the LT (Language Technology)
machines in our own language; and semantic
                                                              field for Galician (Ramírez Sánchez & García
technologies enable advances in the automatic
                                                              Mateo, 2022), which informed on the
comprehension of texts, thus making it possible to
                                                              considerable growth in the production of high-


                                                         27
quality Galician resources and services, especially          models and demonstrators from different areas
text resources.                                              will be made publicly available.
    Despite the quality of these resources, it should            More information on the organization of
be noted that not all are freely and publicly                Proxecto Nós can be found in de-Dios-Flores et
available for the development of LT. The LT field            al, 2022.
has undergone profound changes over the last few
years since the introduction of neural network               4.2. Scientific         and       technological
systems. Generally, training models using these
state-of-the-art technologies requires large                 objectives
quantities of data and has high energetic and
computational costs, which continues to be a                     Proxecto Nós has two main scientific and
challenge for low-resource languages. However,               technological objectives: (i) to integrate the
as many recent studies show, end-to-end                      Galician language into cutting-edge AI and
technologies and open-source multilingual pre-               language technologies, thus enabling the natural
trained models created using large quantities of             use of Galician in human-machine interactions;
data from high-resource languages (Shen et al.,              and (ii) to improve the state of the art of language
2018; Baevski et al., 2020; Wolf et al., 2020) can           technologies for Galician.
be used, through transfer learning and fine-tuning,              For this purpose, resources, tools, and
to train models in low- or medium-resource                   applications will be developed and distributed
languages such as Catalan (Külebi & Öktem,                   under open licenses, which will allow them to be
2018; Külebi et al., 2020) or, in our case, Galician.        integrated into existing devices and services (such
To this end, the existence of resources and tools            as smart speakers or conversational agents) and
that are freely available to the scientific and              future technologies. To this end, specific
business community is essential, and that                    objectives directly related to some of the major
constitutes one of the main objectives of Proxecto           tasks of NLP have been established.
Nós.                                                             Each of these technological objectives will be
                                                             executed in a different subproject, which will
                                                             allow the parallel development of different tasks
4. Project description
                                                             and, overall, a more effective organization of the
4.1. Organization                                            work. However, a set of general objectives are
                                                             shared by all the tasks. These objectives are: (i)
     The tasks that are to be carried out as part of         the compilation of high-quality linguistic
the Project can be included in the following areas,          resources (annotated reference corpora, web-scale
corresponding to some of the major NLP fields:               corpora, specialized corpora by tasks and
    An example of numbered list is as following.             domains, parallel corpora, knowledge bases,
    1. Speech synthesis (TTS)                                dictionaries, etc.); (ii) the elaboration of language
    2. Speech recognition (ASR)                              and acoustic models (both general-purpose and
    3. Automatic text generation                             task-specific models); and (iii) the development
    4. Dialogue systems                                      of applications based on these models. The project
    5. MT                                                    will also have a general coordination mechanism
    6. IE                                                    through which resources will be distributed and
    7. Opinion mining and fact checking                      shared among its subprojects.
    8. Language correction and assessment                        The resources and language models developed
    These broad, mutually interdependent areas               for each task will be made available to the public,
fall within the three strategic lines jointly                thus allowing their use in all kinds of applications,
identified by the Project’s research team and the            services, and products, by the scientific
Xunta de Galicia (in particular, with the Axencia            community, companies, institutions, and society
para a Modernización Tecnolóxica de Galicia): (i)            in general. The results will be disseminated
spoken or written conversation with people, (ii)             through a repository available at the project’s web
language quality, and (iii) information                      portal (which can be hosted on internal servers),
management.                                                  as well as other established and internationally
    In accordance with the funding agreement                 recognized repositories, such as HuggingFace,
signed by the Xunta de Galicia and the USC, the              GitHub, Zenodo, etc.
organization of the tasks included in Nós follows                Finally, the project contemplates the complete
a yearly schedule. Each year, resources, language            development of applications based on these


                                                        28
resources, which will act as visible and accessible          References
demonstrators of the developed technology and
will produce a tractor effect that will lead to the
                                                             [1] I. Aldabe, G. Rehm, G. Rigau, A. Way,
development of new products.
                                                                  Report on existing strategic documents and
                                                                  projects in LT/AI, European Language
5. Conclusion and future work                                     Equality (ELE), 2021.
                                                             [2] R. Ardila, M. Branson, K. Davis, M.
    Among the initial results of Nós, we can                      Henretty, M. Kohler, J. Meyer, R. Morais, L.
highlight the first crawl of a web-based Galician                 Saunders, F. M. Tyers, G. Weber, Common
corpus and a language model based on the CCNet                    Voice: A Massively-Multilingual Speech
tools and data (Ortega et al., 2022a), and the                    Corpus, in: Proceedings of LREC 2020.
development and testing of a Spanish-Galician                [3] A. Baevski, H. Zhou, A. Mohamed, M. Auli,
neural machine translation (NMT) system                           wav2vec 2.0: A Framework for Self-
prototype (Ortega et al., 2022b).                                 Supervised        Learning       of      Speech
    For the current year, Proxecto Nós aims to                    Representations. arXiv, 2020, pp. 1–
keep generating linguistic and computational                      19. doi: 10.48550/arXiv.2006.11477
resources to explore different subprojects.                  [4] I. de-Dios-Flores, C. Magariños, A. I. Vladu,
Specifically, in the first half of 2022 work will be              J. E. Ortega, J. R. Pichel, M. García, P.
carried out on the design of a high-quality speech                Gamallo, E. Fernández Rei, A. Bugarín-Diz,
corpus of sufficient size so as to allow training                 M. González González, S. Barro, X. L.
TTS state-of-the-art models, to be released in the                Regueira, The Nós Project: Opening routes
last trimester. The second half of the year will also             for the Galician language in the field of
see the publication of a speech corpus for ASR. In                language technologies, in: Proceedings of the
the same timeframe, the project will publish                      TDLE Workshop @LREC2022, pp. 52–61
several text corpora: parallel Galician-Spanish,                  Marseille, 20 June 2022.
Galician-English,       and     Galician-Portuguese          [5] C. García Mateo, M. Arza Rodríguez (auth.),
corpora; a web-scale Galician text corpus, larger                 G. Rehm, H. Uszkoreit (eds.), The Galician
than the one already compiled, to be used in all the              Language in the Digital Age, Springer-
subprojects working with written text included in                 Verlag, Berlin Heidelberg, 2012.
Nós; and a domain-specific corpus for automatic              [6] B. Külebi, A. Öktem, Building an Open
text generation. Based on these resources, new                    Source Automatic Speech Recognition
language models will be developed using                           System for Catalan, in: IberSPEECH,
different state-of-the-art techniques, as well as                 Barcelona, Spain, 2018, pp. 25–29.
demonstrators or prototypes of a TTS system,                 [7] B. Külebi, A. Öktem, A. Peiró-Lilja, S.
NMT system, and automatic text generator for                      Pascual, M. Farrús, CATOTRON - A Neural
Galician. At the same time, throughout 2022                       Text-To-Speech System in Catalan. In:
efforts will focus on extending and improving the                 Proceedings of Interspeech 2020.
first systems developed, and on validating the               [8] O. Kjartansson, A. Gutkin, A. Butryna, I.
results obtained via the creation of high-quality                 Demirsahin, C. Rivera, Open-Source High
gold standards.                                                   Quality Speech Datasets for Basque, Catalan
                                                                  and Galician, in: Proceedings of the 1st Joint
                                                                  Workshop on SLTU and CCURL, Marseille,
Acknowledgements                                                  France, 2020, pp. 21–27.
                                                             [9] J. E. Ortega, I. de Dios Flores, P. Gamallo, J.
    This research was funded by the project “Nós:                 R. Pichel, A Neural Machine Translation
Galician in the society and economy of artificial                 System for Spanish to Galician through
intelligence” (Proxecto Nós: O galego na                          Portuguese Transliteration, in: PROPOR
sociedade e economía da intelixencia artificial                   2022, Fortaleza, Brazil.
2021-CP080), agreement between Xunta de                      [10] J. E. Ortega, I. de Dios Flores, J. R. Pichel, P.
Galicia and University of Santiago de                             Gamallo, Revisiting CCNet for Quality
Compostela, and grant ED431G2019/04 by the                        Measurements in Galician, in: PROPOR
Galician Ministry of Education, University and                    2022, Fortaleza, Brazil.
Professional Training, and the European Regional             [11] J. M. Ramírez Sánchez, C. García Mateo
Development Fund (ERDF/FEDER program).                            (auth.), M. Giagkou, S. Piperidis, G. Rehm,


                                                        29
     J. Dunne (eds.), Report on the Galician
     Language (Deliverable D1.15), ELE, 2022.
[12] J. Shen, R. Pang, R. J. Weiss, M. Schuster,
     N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, , Y.
     Wang, R. J. Skerry-Ryan, R. A. Saurous, Y.
     Agiomyrgiannakis, Y. Wu, Natural TTS
     Synthesis By Conditioning Wavenet On Mel
     Spectrogram Predictions, in: Proceedings of
     ICASSP, 2018.
[13] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C.
     Delangue, A. Moi, P. Cistac, T. Rault, R.
     Louf, M. Funtowicz, J. Davison, S. Shleifer,
     P. von Platen, C. Ma, Y. Jernite, J. Plu, C.
     Xu, T. Le Scao, S. Gugger, et al.,
     Transformers: State-of-the-Art Natural
     Language Processing. In: Proceedings of the
     2020 Conference on Empirical Methods in
     NLP: System Demonstrations, 2020, pp. 38–
     45.


                                                    30