Proxecto Nós: Artificial intelligence at the service of the Galician language Proxecto Nós: Inteligencia artificial al servicio de la lengua gallega Adina Ioana Vladu 1, Iria de-Dios-Flores 2, Carmen Magariños 1, John E. Ortega 2, José Ramom Pichel 2, Marcos Garcia 2, Pablo Gamallo 2, Elisa Fernández Rei 1, Alberto Bugarín 2, Manuel González González 1, Senén Barro 2 and Xosé Luis Regueira 1 1 Instituto da Lingua Galega (ILG) - Universidade de Santiago de Compostela, Spain 2 Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS) - Universidade de Santiago de Compostela, Spain Abstract Proxecto Nós is an initiative aimed at providing the Galician language with openly licensed resources, tools, and demonstrators in the area of intelligent technologies. The Project has two main scientific and technological objectives: (i) to integrate the Galician language into cutting- edge AI and language technologies, thus enabling the natural use of Galician in human-machine interactions; and (ii) to improve the state of the art of language technologies for Galician. Keywords 1 Language technologies, linguistic rights, Galician, low-resource languages. 1. Introduction of Santiago de Compostela (USC) and is currently being carried out by a research team comprising members of the Instituto da Lingua Galega (ILG) Proxecto Nós (The Nós Project) is an initiative and the Centro Singular de Investigación en promoted by the Galician Government (Xunta de Tecnoloxías Intelixentes (CiTIUS). The first Galicia), aimed at providing the Galician stage, spanning from the final trimester of 2021 to language with openly licensed resources, tools, 2025, will lay the foundations and provide the demonstrators, and use cases in the area of resources that will help place Galician among the intelligent technologies. The execution of languages that are fully active in the digital Proxecto Nós has been entrusted to the University society and economy. SEPLN-PD 2022. Annual Conference of the Spanish Association for Natural Language Processing 2022: Projects and Demonstrations, September 21-23, 2022, A Coruña, Spain EMAIL: adina.vladu@usc.gal (A.I. Vladu); iria.dedios@usc.gal (I. de-Dios-Flores); mariadelcarmen.magarinos@usc.gal (C. Magariños); john.ortega@usc.gal (J. Ortega); jramom.pichel@usc.gal (J.R. Pichel); marcos.garcia.gonzalez@usc.gal (M. Garcia); pablo.gamallo@usc.gal (P. Gamallo); elisa.fernandez@usc.gal (E. Férnandez Rei); alberto.bugarin.diz@usc.gal (A. Bugarín); manuel.gonzalez.gonzalez@usc.gal (M. González González) senen.barro@usc.gal (S. Barro); xoseluis.regueira@usc.gal (X.L. Regueira) ORCID: 0000-0002-3910-7820 (A.I. Vladu); 0000-0002-5941- 1707 (I. de-Dios-Flores); 0000-0003-3525-1304 (C. Magariños); 0000-0002-2328-3205 (J. Ortega); 0000-0001-5172-6803 (J.R. Pichel); 0000-0002-6557-0210 (M. Garcia); 0000-0002-5819- 2469 (P. Gamallo); 0000-0002-4109-0087 (E. Férnandez Rei); 0000-0003-3574-3843 (A. Bugarín); 0000-0001-7948-4607 (M. González González) 0000-0001-6035-540X (S. Barro); 0000- 0001-7264-3740 (X.L. Regueira) ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 26 2. Context and motivation process enormous quantities of documents. In the case of Galician, incorporating the language into state-of-the-art AI applications can not only The development of language technologies is significantly favor its prestige (a decisive factor in a strategic innovation area geared towards the language normalization), but also guarantee digital society and economy, and it has been a citizens’ language rights and reduce social priority in both Spanish (Plan Estatal de inequality. Investigación Científica y Técnica y de In economic terms, the global Natural Innovación, Estrategia Española de Ciencia y Language Processing (NLP) market size was Tecnología y de Innovación) and European valued at more than USD 10 billion in 2020 and (Horizon 2020) scientific planning. Technologies is expected to reach USD 41 billion by 2025 such as machine translation (MT), information (Aldabe et al., 2021). NLP technologies are used extraction (IE), text analytics, and dialogue in different areas such as information retrieval, systems are essential in the digital society, culture, MT, IE (with notable growth in its application in and economy. the medical domain during the Covid-19 Languages in high demand worldwide pandemic), dialogue systems, and automatic text (especially English) benefit from a large variety generation, among many others. The capacity to of computational resources that can contribute to model language, an essential ability for human developing new automatic language processing beings, ensures a promising future for such technologies and tools. Such is the case due to the technologies from both an economic and research long-standing research tradition in these areas and innovation perspective. (e.g., the variety of projects financed by USA’s DARPA) and the need to incorporate such languages into the AI applications associated with 3. State of the art: Galician resources the latest electronic devices (such as the and technologies conversational AI or automatic dictation software developed by Google, Amazon or Apple). Other In 2012, the White Paper The Galician languages that have joined AI research later, such Language in the Digital Age (García-Mateo et al., as Chinese, are currently following in the 2012) described Galician as a language with a footsteps of English, through projects such as level of technological support that “gives rise to Baidu’s Qian Yan, which improve significantly cautious optimism”, while highlighting the need the computational resources available in their for new resources and tools. Previous research respective language varieties. projects on Galician resulted in speech processing Notwithstanding, language technologies are resources (COTOVÍA), an annotated reference also necessary for languages in lower corpus (CORGA), morphosyntactic lemmatizers international demand. Consequently, different and taggers (XIADA, FreeLing, IXA-Pipes), languages have developed similar initiatives to other specialized corpora, both text (CLUVI, Nós. Among others, we can highlight Projecte CTG. TreeGal) and speech (CORILGA, AGO), AINA, which will develop computational MT systems (GAIO, OpenTrad), spellcheckers resources for Catalan until 2024, or the work (OrtoGal), grammar checkers (Avalingua), carried out at the HiTZ Research Center, focusing language analysis and IE tools (Linguakit), on languages technologies for Basque. Other language models (SemantiGal, Bertinho), and projects, such as CorCenCC (in Great Britain, for other resources. Welsh) or UQAILAUT (in Canada, for Inuktitut) Furthermore, Galician is currently part of were considered success cases in the promotion of multilingual crowdsourced data collection the digital use of socially threatened languages. initiatives carried out by important companies on The democratization of language technologies the global IT market, which have resulted in has a great social and cultural impact on the speech databases such as Google’s SLR77 communities that use them. For instance, MT (Kjartansson et al., 2020) and Mozilla’s increases access to contents in different CommonVoice 7.0 and 8.0 (Ardila et al., 2020). languages, thus facilitating intercultural relations; This situation is reflected in a recent report on the dialogue systems allow us to communicate with current state of the LT (Language Technology) machines in our own language; and semantic field for Galician (Ramírez Sánchez & García technologies enable advances in the automatic Mateo, 2022), which informed on the comprehension of texts, thus making it possible to considerable growth in the production of high- 27 quality Galician resources and services, especially models and demonstrators from different areas text resources. will be made publicly available. Despite the quality of these resources, it should More information on the organization of be noted that not all are freely and publicly Proxecto Nós can be found in de-Dios-Flores et available for the development of LT. The LT field al, 2022. has undergone profound changes over the last few years since the introduction of neural network 4.2. Scientific and technological systems. Generally, training models using these state-of-the-art technologies requires large objectives quantities of data and has high energetic and computational costs, which continues to be a Proxecto Nós has two main scientific and challenge for low-resource languages. However, technological objectives: (i) to integrate the as many recent studies show, end-to-end Galician language into cutting-edge AI and technologies and open-source multilingual pre- language technologies, thus enabling the natural trained models created using large quantities of use of Galician in human-machine interactions; data from high-resource languages (Shen et al., and (ii) to improve the state of the art of language 2018; Baevski et al., 2020; Wolf et al., 2020) can technologies for Galician. be used, through transfer learning and fine-tuning, For this purpose, resources, tools, and to train models in low- or medium-resource applications will be developed and distributed languages such as Catalan (Külebi & Öktem, under open licenses, which will allow them to be 2018; Külebi et al., 2020) or, in our case, Galician. integrated into existing devices and services (such To this end, the existence of resources and tools as smart speakers or conversational agents) and that are freely available to the scientific and future technologies. To this end, specific business community is essential, and that objectives directly related to some of the major constitutes one of the main objectives of Proxecto tasks of NLP have been established. Nós. Each of these technological objectives will be executed in a different subproject, which will allow the parallel development of different tasks 4. Project description and, overall, a more effective organization of the 4.1. Organization work. However, a set of general objectives are shared by all the tasks. These objectives are: (i) The tasks that are to be carried out as part of the compilation of high-quality linguistic the Project can be included in the following areas, resources (annotated reference corpora, web-scale corresponding to some of the major NLP fields: corpora, specialized corpora by tasks and An example of numbered list is as following. domains, parallel corpora, knowledge bases, 1. Speech synthesis (TTS) dictionaries, etc.); (ii) the elaboration of language 2. Speech recognition (ASR) and acoustic models (both general-purpose and 3. Automatic text generation task-specific models); and (iii) the development 4. Dialogue systems of applications based on these models. The project 5. MT will also have a general coordination mechanism 6. IE through which resources will be distributed and 7. Opinion mining and fact checking shared among its subprojects. 8. Language correction and assessment The resources and language models developed These broad, mutually interdependent areas for each task will be made available to the public, fall within the three strategic lines jointly thus allowing their use in all kinds of applications, identified by the Project’s research team and the services, and products, by the scientific Xunta de Galicia (in particular, with the Axencia community, companies, institutions, and society para a Modernización Tecnolóxica de Galicia): (i) in general. The results will be disseminated spoken or written conversation with people, (ii) through a repository available at the project’s web language quality, and (iii) information portal (which can be hosted on internal servers), management. as well as other established and internationally In accordance with the funding agreement recognized repositories, such as HuggingFace, signed by the Xunta de Galicia and the USC, the GitHub, Zenodo, etc. organization of the tasks included in Nós follows Finally, the project contemplates the complete a yearly schedule. Each year, resources, language development of applications based on these 28 resources, which will act as visible and accessible References demonstrators of the developed technology and will produce a tractor effect that will lead to the [1] I. Aldabe, G. Rehm, G. Rigau, A. Way, development of new products. Report on existing strategic documents and projects in LT/AI, European Language 5. Conclusion and future work Equality (ELE), 2021. [2] R. Ardila, M. Branson, K. Davis, M. Among the initial results of Nós, we can Henretty, M. Kohler, J. Meyer, R. Morais, L. highlight the first crawl of a web-based Galician Saunders, F. M. Tyers, G. Weber, Common corpus and a language model based on the CCNet Voice: A Massively-Multilingual Speech tools and data (Ortega et al., 2022a), and the Corpus, in: Proceedings of LREC 2020. development and testing of a Spanish-Galician [3] A. Baevski, H. Zhou, A. Mohamed, M. Auli, neural machine translation (NMT) system wav2vec 2.0: A Framework for Self- prototype (Ortega et al., 2022b). Supervised Learning of Speech For the current year, Proxecto Nós aims to Representations. arXiv, 2020, pp. 1– keep generating linguistic and computational 19. doi: 10.48550/arXiv.2006.11477 resources to explore different subprojects. [4] I. de-Dios-Flores, C. Magariños, A. I. Vladu, Specifically, in the first half of 2022 work will be J. E. Ortega, J. R. Pichel, M. García, P. carried out on the design of a high-quality speech Gamallo, E. Fernández Rei, A. Bugarín-Diz, corpus of sufficient size so as to allow training M. González González, S. Barro, X. L. TTS state-of-the-art models, to be released in the Regueira, The Nós Project: Opening routes last trimester. The second half of the year will also for the Galician language in the field of see the publication of a speech corpus for ASR. In language technologies, in: Proceedings of the the same timeframe, the project will publish TDLE Workshop @LREC2022, pp. 52–61 several text corpora: parallel Galician-Spanish, Marseille, 20 June 2022. Galician-English, and Galician-Portuguese [5] C. García Mateo, M. Arza Rodríguez (auth.), corpora; a web-scale Galician text corpus, larger G. Rehm, H. Uszkoreit (eds.), The Galician than the one already compiled, to be used in all the Language in the Digital Age, Springer- subprojects working with written text included in Verlag, Berlin Heidelberg, 2012. Nós; and a domain-specific corpus for automatic [6] B. Külebi, A. Öktem, Building an Open text generation. Based on these resources, new Source Automatic Speech Recognition language models will be developed using System for Catalan, in: IberSPEECH, different state-of-the-art techniques, as well as Barcelona, Spain, 2018, pp. 25–29. demonstrators or prototypes of a TTS system, [7] B. Külebi, A. Öktem, A. Peiró-Lilja, S. NMT system, and automatic text generator for Pascual, M. Farrús, CATOTRON - A Neural Galician. At the same time, throughout 2022 Text-To-Speech System in Catalan. In: efforts will focus on extending and improving the Proceedings of Interspeech 2020. first systems developed, and on validating the [8] O. Kjartansson, A. Gutkin, A. Butryna, I. results obtained via the creation of high-quality Demirsahin, C. Rivera, Open-Source High gold standards. Quality Speech Datasets for Basque, Catalan and Galician, in: Proceedings of the 1st Joint Workshop on SLTU and CCURL, Marseille, Acknowledgements France, 2020, pp. 21–27. [9] J. E. Ortega, I. de Dios Flores, P. Gamallo, J. This research was funded by the project “Nós: R. Pichel, A Neural Machine Translation Galician in the society and economy of artificial System for Spanish to Galician through intelligence” (Proxecto Nós: O galego na Portuguese Transliteration, in: PROPOR sociedade e economía da intelixencia artificial 2022, Fortaleza, Brazil. 2021-CP080), agreement between Xunta de [10] J. E. Ortega, I. de Dios Flores, J. R. Pichel, P. Galicia and University of Santiago de Gamallo, Revisiting CCNet for Quality Compostela, and grant ED431G2019/04 by the Measurements in Galician, in: PROPOR Galician Ministry of Education, University and 2022, Fortaleza, Brazil. Professional Training, and the European Regional [11] J. M. Ramírez Sánchez, C. García Mateo Development Fund (ERDF/FEDER program). (auth.), M. Giagkou, S. Piperidis, G. Rehm, 29 J. Dunne (eds.), Report on the Galician Language (Deliverable D1.15), ELE, 2022. [12] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, , Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, Y. Wu, Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions, in: Proceedings of ICASSP, 2018. [13] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, et al., Transformers: State-of-the-Art Natural Language Processing. In: Proceedings of the 2020 Conference on Empirical Methods in NLP: System Demonstrations, 2020, pp. 38– 45. 30